Steam_Reviews_Aspect_Based_Sentiment_Analysis
Steam_Reviews_Aspect_Based_Sentiment_Analysis
1 Abstract
Due to Steam’s feature of specifying whether users recommend the game in a
review, it is easy to determine the overall sentiment from the review. However,
this sentiment may not be enough to determine which features a game gets
right or wrong. Instead, game developers may benefit from capturing players’
sentiments towards specific aspects. Aspect-based sentiment analysis (ABSA)
can aid developers’ work on future updates and games by helping them iden-
tify key strengths and areas that need targeted improvement. Furthermore,
Steam users may also utilize features created with aspect-based sentiment anal-
ysis to be quickly informed of strengths and concerns about a game and make
more informed purchase decisions. In this paper, we implement two approaches,
lexicon-based and machine learning, to perform aspect-based sentiment analy-
sis on Steam reviews and compare them to determine which may work best in
the context of game reviews. The comparison will involve the result of each
approach, the setup requirement, and the speed. Our findings highlight the
strengths and limitations of each approach, specifically for aspect-based senti-
ment analysis for game reviews.
2 Introduction
Steam is the largest video game distribution platform globally, offering a vast
collection of games, reviews, and user-generated data. Its popularity generates a
massive repository of insights into user preferences, especially concerning game
features such as gameplay, graphics, and music. Analyzing this data presents
opportunities to understand and enhance both player satisfaction and developer
strategies.
Steam’s review system allows users to indicate their overall sentiment about
a game through binary classifications—whether they recommend the game or
not. While straightforward, this approach often overlooks the nuanced senti-
ments players express about specific features. Developers seeking to improve
their games must identify what resonates with players and what needs atten-
tion. For instance, a game with numerous reviews praising its art direction
might use this feedback to reinforce such strengths. Conversely, aspect-based
1
sentiment analysis (ABSA) can reveal critical issues like balancing problems or
performance flaws, guiding targeted improvements. This process is particularly
valuable for live-service games like Dota or Counter-Strike, where maintaining
player satisfaction is crucial for long-term engagement.
Players can also benefit from an analysis of game aspects. A potential
buyer seeking compelling storytelling or strong performance on limited hard-
ware could make more informed decisions with aspect-specific sentiment ratings
or summaries. While forums and reviews provide some of this information, a
structured analysis can streamline the decision-making process by highlighting
critical strengths and weaknesses quickly.
ABSA has been widely applied in various domains, including product re-
views, social media analysis, and customer feedback. Companies like Amazon
and Best Buy already incorporate this technology to enhance customer insights
and decision-making processes. However, Steam has yet to leverage ABSA to en-
hance its review ecosystem. By applying ABSA to Steam reviews, we aim to pro-
vide actionable insights for developers while offering users data-driven analyses
of game features. This dual benefit empowers both demographics—developers in
refining their products and players in making more informed purchase decisions.
The two primary approaches to ABSA are lexicon-based and machine learning-
based, with hybrid methods also emerging as a promising alternative. Nath and
Dwivedi(2024) [5] outline these techniques, including lexical, KNN, and BERT-
based ABSA, demonstrating their distinct advantages and challenges. In this
paper, we implement and compare these approaches to identify the most ef-
fective method for analyzing game reviews. To achieve this, we use the Steam
Video Game and Bundle Data from previous studies. Kang and McAuley (2018)
[4] introduced a self-attention-based sequential model for next-item recommen-
dations, using a dataset of Steam reviews from 2010 to 2018. Wan and McAuley
(2018)[8] developed a recommendation framework leveraging monotonic depen-
dency structures in user feedback, focusing on Australian users. Pathak, Gupta,
and McAuley (2017) [6] contributed another dataset centered on personalized
bundle recommendations. By building on these datasets, our work aims to
bridge the gap between ABSA methodologies and their application to video
game reviews.
3 Methods
3.1 Data Preprocessing
The dataset was provided in JSON format, with each line representing a user and
their associated reviews. We first processed the dataset using the ast module to
parse each line into a dictionary. The dictionary was then converted into rows
of reviews for a Pandas data frame for further processing. The dataset contains
over 93 millions reviews, giving us a robust foundation for our analysis.
2
3.2 Lexicon-based Sentiment Analysis
3.2.1 Data Cleaning and Aspect Construction
Cleaning the data for this task involved several essential preprocessing steps to
ensure high-quality input for analysis. First, we identified and removed stop
words from the reviews using the SpaCy library [1]. Stop words, such as ”the,”
”is,” and ”and,” often add little meaning to the analysis and can be safely ex-
cluded to focus on more relevant terms. Next, contractions in the text (e.g.,
”can’t,” ”won’t”) were expanded into their full forms (”cannot,” ”will not”) us-
ing a predefined contraction dictionary. This dictionary was specifically sourced
from Rob Taylor’s repository, designed for analyzing Steam reviews [7]. By
expanding contractions, we standardized the text for consistent tokenization.
Finally, lemmatization was applied with SpaCy to reduce words to their base or
dictionary forms. For instance, words like ”running,” ”ran,” and ”runs” were
converted to ”run,” ensuring that different variants of a word were treated as a
single entity during analysis.
3
3.2.3 Sentiment Analysis and Aspect Scoring
Once the aspect term lists were finalized, we analyzed the sentiment of each
sentence in the reviews. Using nltk’s lexicon-based sentiment analyzer tool,
Vader [2], we calculated the sentiment polarity (positive, negative, or neutral)
of sentences containing terms associated with specific aspects. If a sentence
included a term from an aspect’s dictionary, the sentiment score of that sentence
was assigned to the corresponding aspect. For example, if a sentence mentioned
”gameplay” in a positive context, that positive score was attributed to the
gameplay aspect.
By combining manual selection, embedding-based expansion, and lexicon-
based sentiment analysis, we were able to create a robust framework for ana-
lyzing aspect-based sentiment in game reviews. This process ensured that our
analysis was both comprehensive and adaptable to the nuances of player feed-
back.
4
of the reviews using the sentence-transformers library, which use SBERT model
for creating vector representation of sentences. The embeddings were then clus-
tered with the K-Means algorithm, with the goal of identifying the key topics of
these sentences related to different aspects of a game. The sentences closest to
the clusters’ centroids were selected as representatives of the aspects and inter-
preted for their sentiment. This step is crucial, as having the LLM analyze every
review would require considerable computation. Using K-Means to find repre-
sentative sentences helps to scale this approach to a very large amount of review
data. Finally, we set up a pipeline for using a large language model (LLM) for
aspect-based sentiment analysis and summary. We chose to pass the resulting
sentences into Llama 3.2 1b Instruct to leverage its text generation, summariza-
tion, and sentiment extraction capabilities, inspired by Amazon’s relatively new
AI feature ”Customers say”.
4 Results
With our first two methods, we performed aspect-based sentiment analysis on
three games’ reviews: Terraria, Sonic the Hedgehog 4 - Episode I, and No
Man’s Sky. These games were chosen based on their overall users’ sentiment at
the time of data collection—positive, negative, and mixed, respectively. This
allowed us to compare how these methods function for games with different
overall sentiments and subjectively gauge the performance of each method.
5
Figure 1: Aspect Sentiment Scores for Terraria
between the two previous games. Figure 6 also show that the sentiment towards
different features of No Man’s Sky as mostly neutral.
6
Figure 3: Aspect Sentiment Scores for Sonic
tially tested our model trained with PyABSA on sample reviews of No Man’s
Sky, we noticed a difference between using a pre-trained checkpoint directly and
fine-tuning the checkpoint. On one hand, using a pre-trained checkpoint allowed
the model to identify relevant aspects and their sentiments with impressive ac-
curacy. It correctly recognized positive experiences and negative factors, though
it sometimes overlooked certain specific aspects. On the other hand, fine-tuning
the model with our dataset led to a much less balanced performance. Due to
our aforementioned problems with the dataset, the fine-tuned model ended up
detecting mostly positive aspects and struggled to pick up on many neutral or
negative sentiments. This is one of our best result obtained from using the
fine-tuned model.
7
Figure 5: Aspect Sentiment Scores for No Man’s Sky
8
I ’ m having a great <time:Positive Confidence:0.9897> with this one . Sure , it suffered fr
9
Figure 8: Aspect-Based Sentiment Distribution of Sonic
10
side, Sonic shows widespread dissatisfaction, especially in gameplay, audio, and
performance, signaling the need for substantial improvements. Meanwhile, No
Man’s Sky largely delights players with its community and gameplay, but stum-
bles when it comes to pricing and performance issues. In the end, these findings
emphasize the importance of balancing various aspects of a gaming experience
and addressing any shortcomings that may hinder player satisfaction.
11
since we gave these unconnected sentences to the LLM, it may not have had the
full context for some sentences, resulting in inaccurate responses. For example,
in one experiment, it gave us the analysis: ”Players appreciate the game’s ac-
cessibility, noting that they can complete the main story in under 200 hours.”
This is not true, as Terraria does not have a main story, and completing a game
in 200 hours is not typically considered accessible for the average player. Never-
theless, the model gave us coherent responses that could be insightful regarding
the features of the games and even output the correct overall sentiment of each
game. Here a full result from running a different experiment on No Man’s Sky
after tuning our prompt for the LLM:
The game’s base-building aspect is seen as fun and engaging, with some players appreciating
While predicting the overall sentiment is not our task and is not a direct way
of determining the approach’s performance, it gives us a form of sanity check
to see if the model output results consistent with the known overall sentiment.
5 Discussion
Overall, we see a major speed difference between using each approach. To
demonstrate, We selected a top review from Farming Simulator to run experi-
ments and time them. From our experiment, to analyze the review, using the
lexicon based sentiment analysis tool took 0.05, the supervised model took 1.3,
while the Llama 3.2 model took 7.7 seconds. The speed of the LLM is one of
the reasons we use K-Means to trim down the amount of data it has to process.
Using the Vader sentiment analysis tool was the fastest and allowed us to
extract metrics regarding sentiments towards different aspects of a game using
the entire dataset. For example, as shown in Figure 5, audio received the highest
sentiment score, indicating that No Man’s Sky reviewers are especially happy
with the game’s sound design. Storytelling and visuals also performed well,
suggesting that players appreciate these elements of the game. Interestingly, the
sentiment towards performance was the lowest for all three games. Reviewers,
when they mention the performance aspect of the game, seem to view it in a
negative light. This can be explained by players generally not writing reviews
to praise the performance of a game, but criticizing it when they encounter
performance issues. Additionally, this method requires some knowledge of games
and their design aspects to effectively construct an aspect terms list.
The unsupervised machine learning method with K-Means was easier to
set up, introduced the least human bias, and required minimal prior knowledge
regarding the data. It also provided the most natural yet still insightful analysis
with the help of an LLM. However, it was difficult to determine which aspects
were left out by selecting representative sentences.
Our specialized ABSA model approach required considerable time and effort
to label the data if we wanted to train the model from the start or adapt it
to specific domains like game reviews. However, the model is and potentially
better than both the lexicon tool and a pre-trained LLM for aspect extraction
12
and sentiment analysis. The supervised model’s results aligned much closer to
our expectation from the overall sentiment of the three games compared to the
lexicon-based approach. On the other hand, the model is faster than an LLM
for analyzing a single review. This means it can be used to analyze a huge
amount of reviews in a reasonable amount of time.
Our lexicon-based approach and supervised model approach give us the re-
sults in the form of metrics, while our K-Means clustering approach combined
with LLM give us summarized and coherent text about the reviews. These two
forms of result are fundamentally different, and it is hard to compare them di-
rectly. However, we think these two approach can be used in conjunction with
each other, as evidently used by Best Buy. The metric rating of each aspect can
be a quick way for users to identify strength and weakness of a game, while the
summarized text is helpful for providing nuance analysis for interested users.
5.1 Limitations
Our analysis has some limitations that need to be acknowledged. First, the
sentiment scoring might not fully account for the complexity of natural lan-
guage. Reviews that include sarcasm, ambiguous expressions, or mixed senti-
ments could lead to misclassification, affecting the accuracy of the sentiment
analysis. This is especially relevant for highly nuanced or community-specific
language often found in gaming reviews.
Second, the lexicon-based method is limited by its reliance on predefined
aspect terms and the subsequent aspect assignments. While it captures general
trends, it may miss emerging or game-specific aspects not included in the initial
list, such as new features introduced through updates. Furthermore, words can
be assigned to different aspects depending on the context. For example, ”fps”
could mean ”first-person shooter,” related to gameplay, or ”frames per second,”
related to performance. Our method operates on a sentence-by-sentence basis,
which means it does not perform well when different aspects exist within a single
sentence.
The clustering approach, while useful for grouping reviews, depends heavily
on the selection of the optimal number of clusters. The Elbow Method provides
a heuristic, but K-Means may not always capture the true underlying structure
of the data. Furthermore, running an LLM can be computationally expensive,
and text generated by it should always be questioned for correctness.
The small labeled dataset for the supervised machine learning method limits
the model’s ability to generalize to unseen data. Additionally, as we were the
only people manually labeling the dataset, the training data may contain errors
and bias.
13
machine learning models, resulting in improved aspect extraction and sentiment
classification.
Lastly, future work could include more granular aspect analysis. For ex-
ample, breaking down ”performance” into subcategories like ”loading times,”
”frame rates,” and ”stability” could provide better insights. Incorporating user-
generated tags and metadata from reviews could potentially help identify emerg-
ing trends or concerns related to newly released content or updates.
14
References
[1] Matthew Honnibal and Ines Montani. “spaCy 2: Natural language un-
derstanding with Bloom embeddings, convolutional neural networks and
incremental parsing”. To appear. 2017.
[2] C. Hutto and Eric Gilbert. “Vader: A parsimonious rule-based model for
sentiment analysis of social media text”. In: Proceedings of the Inter-
national AAAI Conference on Web and Social Media 8.1 (May 2014),
pp. 216–225. doi: 10.1609/icwsm.v8i1.14550.
[3] Intellica.AI. Aspect-based sentiment analysis-everything you wanted to know!
Feb. 2020. url: https://intellica- ai.medium.com/aspect- based-
sentiment-analysis-everything-you-wanted-to-know-1be41572e238.
[4] Wang-Cheng Kang and Julian McAuley. Self-Attentive Sequential Recom-
mendation. 2018. arXiv: 1808 . 09781 [cs.IR]. url: https : / / arxiv .
org/abs/1808.09781.
[5] Deena Nath and Sanjay K. Dwivedi. “Aspect-based sentiment analysis:
approaches, applications, challenges and trends”. In: Knowl. Inf. Syst.
66.12 (Aug. 2024), pp. 7261–7303. issn: 0219-1377. doi: 10.1007/s10115-
024-02200-9. url: https://doi.org/10.1007/s10115-024-02200-9.
[6] Apurva Pathak, Kshitiz Gupta, and Julian McAuley. “Generating and
Personalizing Bundle Recommendations on Steam”. In: Proceedings of the
40th International ACM SIGIR Conference on Research and Development
in Information Retrieval. SIGIR ’17. Shinjuku, Tokyo, Japan: Association
for Computing Machinery, 2017, pp. 1073–1076. isbn: 9781450350228. doi:
10.1145/3077136.3080724. url: https://doi.org/10.1145/3077136.
3080724.
[7] Rob Taylor. SA Steam Reviews. https://github.com/r0btaylor/SA_
steam_reviews. 2022.
[8] Mengting Wan and Julian McAuley. “Item recommendation on monotonic
behavior chains”. In: Proceedings of the 12th ACM Conference on Recom-
mender Systems. RecSys ’18. Vancouver, British Columbia, Canada: Asso-
ciation for Computing Machinery, 2018, pp. 86–94. isbn: 9781450359016.
doi: 10 . 1145 / 3240323 . 3240369. url: https : / / doi . org / 10 . 1145 /
3240323.3240369.
[9] Heng Yang, Chen Zhang, and Ke Li. “PyABSA: A Modularized Frame-
work for Reproducible Aspect-based Sentiment Analysis”. In: Proceedings
of the 32nd ACM International Conference on Information and Knowledge
Management. CIKM ’23. Birmingham, United Kingdom: Association for
Computing Machinery, 2023, pp. 5117–5122. isbn: 9798400701245. doi:
10.1145/3583780.3614752. url: https://doi.org/10.1145/3583780.
3614752.
15
[10] Heng Yang et al. “A Multi-task Learning Model for Chinese-oriented
Aspect Polarity Classification and Aspect Term Extraction”. In: CoRR
abs/1912.07976 (2019). arXiv: 1912.07976. url: http://arxiv.org/
abs/1912.07976.
[11] Biqing Zeng et al. “LCF: A Local Context Focus Mechanism for Aspect-
Based Sentiment Classification”. In: Applied Sciences 9.16 (2019). issn:
2076-3417. doi: 10.3390/app9163389. url: https://www.mdpi.com/
2076-3417/9/16/3389.
16