0% found this document useful (0 votes)
7 views

ArffaLimRachleff LearningToCook Poster

Research papers

Uploaded by

sojogil742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ArffaLimRachleff LearningToCook Poster

Research papers

Uploaded by

sojogil742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Learning to Cook – An Exploration of Recipe Data

Travis Arffa (tarffa), Rachel Lim (rachelim), and Jake Rachleff (jakerach)

Goals Data/Features Random Forest


We set out to solve two problems. First, we wanted to Data: We scraped all recipes currently on Epicurious.com for Model: Random Forest uses randomized samples of data
figure out the different “types” of recipes based purely on our data set. For each recipe, we scraped its ingredients, to fit several smaller regression trees. It outputs a
what ingredients were included, which would allow us to preparation steps, nutritional information, cook time, and user prediction that is the average output of each tree.
understand which ingredients are prevalent in which type ratings (ranging 0-100). We filtered out recipes with fewer Randomization reduces correlation between trees, and the
of cuisine. Second, we wanted to predict recipe review than 15 ratings, and collected 10,440 recipes in total. For use of multiple trees counteracts overfit. We chose RF due
scores based on recipe ingredients and real valued clustering, we used all the recipe data. For prediction, we to the large number of ingredients and potential overfit to
features such as nutrition score and number of steps. randomly partitioned the dataset into training and test sets ingredients that occur frequently in the train data.
constituting 80% and 20% of the recipes respectively.
(1) Clustering Recipes Features: For Naive Bayes, Clustering, and Random Forest, Full Reduced

Model: We sought to define a cuisine based solely on its we used hand-curated R355 binary feature vectors of Min. Leaf Size 50 5

ingredients and no preconceived notions about cuisine ingredients. We used simple features like number of steps, Num. Trees 100 10

itself. Thus, we found the unsupervised learning strategy of number of ingredients for linear regressions, and expanded Num. Predictors 355 15

k-means clustering to be the best model for this task, which the ingredient features for Naive Bayes as well. Sub-Sample Size 200 200

we could then verify with a recipe’s tags.


(2) Recipe Rating Prediction
Abs Test Error 6.07 6.08

Results and discussion:


Varying the number of clusters, To learn the quality of a recipe (measured by its rating), we Results: RF had an average absolute test error around
we obtained the following graph tried several different machine learning models, including 6.08, and MSE of 86.5. RF outperformed other regression
of total squared error. We see
an inflection point around k=3,
linear regression, locally weighted linear regression, Naive
Bayes, and Random Forest. We discuss the latter two
`
techniques that we attempted, including linear regression
and locally-weighted linear regression.
suggesting 3 as the optimal models in depth. Discussion: The test errors for each model indicate that
number of clusters. additional features and tree complexity did not yield more
Naive Bayes accurate predictions. Fifteen predictors were chosen for
Model: Naive Bayes is a probabilistic model for the reduced model based on the Out-of-Bag Variable
classification that assumes the occurrence of features is Importance parameter, which equals the average
conditionally independent given the class variable. While difference between tree outputs that included the feature,
this assumption does not hold in the case of recipes, it is a and those that did not. The sample size was kept constant
good baseline model for prediction. to account for the sparsity of the feature vectors.
Data: We discretized the ratings into evenly-sized buckets
The graphs shows clusters for k=3 and k=4. For k=3,
inspecting the tags of recipes belonging to each cluster, we
and performed multiclass Naive Bayes classification, Future Research
experimenting with number of buckets and feature type (a
observe that these clusters correspond to meals, drinks and The next step for our project would be to auto-generate
R355 binary feature vector of ingredients, and a R126025
desserts. We also observe an interesting trend: as we binary feature vector of paired ingredients). recipes using the clustered tags while striving to
increased the number of clusters, these recipe classes were maximize ratings. This would represent a combination
Results and discussion: Prediction Mean Absolute Error
split further into natural subclasses. When k increases from of the supervised and unsupervised techniques
works better with pairwise ingredient Number of 10 20
3 to 4, the cluster corresponding to ‘meals’ (in purple) is currently presented, as well as additional modeling to
features. We expect this to be the case, buckets
split into Asian and European cuisines. For each increase of Basic 8.41 9.42 account for varying amounts of each ingredient.
since the conditional independence
k past the kink, we still discover new cuisine types with ingredient Another interesting avenue of research would be to look
assumption does not hold for recipe features
similar flavors based on their tags and ingredients, meaning at how the ingredients and amounts of each ingredient
ingredients, and ingredients tend to “go Pairwise 7.78 8.84
that our most informative cluster sizes were not dependent ingredient correspond to nutritional value.
well together”. features
on cluster error graph’s inflection.

You might also like