Predicting Movie Success Based On Imdb Data
Predicting Movie Success Based On Imdb Data
net/publication/320666524
CITATION READS
1 332
1 author:
Nithin Vr
National Institute of Technology Calicut
3 PUBLICATIONS 6 CITATIONS
SEE PROFILE
All content following this page was uploaded by Nithin Vr on 24 April 2019.
Abstract: Film studios in the America, every year produce several hundred movies that make the United States the third most
abundant producer of films in the world. The budget of these movies is of the order of hundreds of millions of dollars, making
their box office success absolutely essential for the survival of the industry. Knowing which movies are likely to succeed and
which are likely to fail before the release could benefit the production houses greatly as it will enable them to focus their
advertising campaigns which itself cost millions of dollars, accordingly. And it could also help them to know when it is most
appropriate to release a movie by looking at the overall market. So, the prediction of movie success is of great importance to the
industry. Machine learning algorithms are widely used to make predictions such as growth in the stock market, demand for
products, nature of tumors, etc. This paper presents a detailed study of logistic regression, SVM regression, and linear regression
on IMDb data to predict movie box office.
Keywords: Data Mining, Logistic Regression, SVM Regression, Linear Regression.
I. INTRODUCTION
In the United States of America, 1000s of films are released every year. Since the 1920s, the American film industry has grossed more
money every year than that of any other country [1]. Cinema in America is a multi-billion dollar industry where even individual films
earn over a billion dollars. Large production houses control most of the film industry, with billions of dollars spent on advertisements
alone. Advertising campaigns contribute heavily to the total budget of the movies. Sometimes the investment results in heavy losses
to the producers. Warner Brothers, one of the largest production houses had a fall in their revenues last year, despite the inflation and
the increased number of movies released. If it was somehow possible to know beforehand the likelihood of success of the movies, the
production houses could adjust the release of their movies so as to gain maximum profit. They could use the predictions to know
when the market is dull and when it is not. This shows a dire need for such software to be developed. Many have tried to accomplish
this goal of predicting movie revenues. Techniques such as social media sentiment analysis has been used in the past. None of the
studies thus far have succeeded in suggesting a model good enough to be used in the industry. In this study, we attempt to use IMDb
data to predict the gross revenue of the movies as well as their IMDb rating. The paper organized as follows. Section II describes the
role of data set collection and preprocessing indata mining. Regression methods and models have discussed in section III. Results are
shown in section IV and Section V concludes the paper. General Design is shown in Figure 1.
B. Data Preprocessing
The data we obtained are highly susceptible to noisy, missing and inconsistent data due to the huge size and their likely origin from
multiple, heterogeneous sources [2]. We mainly used IMDb and Rotten Tomatoes and Wikipedia. The main problem with datasets
was missing fields. To overcome this missing field problem, we adopted a method which uses a measure of central tendency for the
attribute. We used both mean and median as central tendency. Then removed duplicate items.
C. Data Preprocessing
The Data obtained from three different resources IMDb, Wikipedia and Rotten Tomatoes were then integrated into one database. In
this step, integrated data are transformed or consolidated so that the regression process may be more efficient and easier. Dataset is
mixed with both nominal [3] and numeric attributes, but for a regression process, we need all attributes to be numerical. We used a
measure of central tendency of Box office revenue to convert corresponding nominal attributes to numerical.
Type Features
IV. RESULTS
The result we found out using linear regression was about 51% accurate. Whereas for logistic regression, we got 42.2% accuracy
which is a comparatively low result. SVM approach had a success rate of 39%. The error tolerance for SVM and Linear regression
was 20%. For logistic regression, the error tolerance was 12.5%. Of all the 20 features in the data-set, budget, director, writer, actor1,
actor2, gender, tomato reviews were found to be the most significant features. Results are shown in table 2.
Model Linear Logistic SVM
Regression Regression
Tolerance 20.0% 12.5% 20.0%
Success Rate 50.7% 42.3% 39.0%
Correlation 0.965 - 0.956
Table 2 Results
REFERENCES
[1] Darin Im, Minh Thao, Dang Nguyen, "Predicting Movie Success in the U.S. market," Dept.Elect.Eng, Stanford Univ., California, December, 2011
[2] Jiawei Han, Micheline Kamber, Jian Pei,Data Mining Concepts and Techniques, 3rd ed.MA:Elsevier, 2011, pp. 83-117
[3] Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, 2nd ed.NewYork: Wiley, 1973
[4] Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ:
Lawrence Erlbaum Associates
[5] Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. p. 205.
[6] Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University
Press, 2000. ISBN 0-521-78019-5
[7] W. Zhang and S. Skiena, "Improving movie gross prediction through news analysis", IEEE/WIC/ACM International Conference on Web Intelligence and
Intelligent Agent Technology, Milan, 2009
[8] Sagar V. Mehta, Rose Marie Philip, Aju Talappillil Scaria, "Predicting Movie Rating based on Text Reviews," Dept.Elect.Eng, Stanford Univ., California,
December, 2011
[9] Suhaas Prasad, "Using Social Networks to improve Movie Ratings predictions," Dept.Elect.Eng, Stanford Univ., California, 2010
[10] Deniz Demir, Olga Kapralova, Hongze Lai, "Predicting IMDb Movie Ratings Using Google Trends," Dept.Elect.Eng,Stanford Univ., California, December,
2012
[11] The International Movie Database (IMDb). https://www.imdb.com/.