SNA Unit-5
SNA Unit-5
1. Informality and Conversational Tone: Social media text often mirrors spoken
language. You'll find:
● Contractions: Words like "gonna," "wanna," and "ain't" are common.
● Slang and Colloquialisms: Everyday, informal language and regional
expressions pop up frequently.
● Shorter Sentences: Brevity is key for быстрое consumption.
● Direct Address: Users often speak directly to their audience or other users.
5. Personalization and Identity: Users often craft their online persona through their
text:
● Personal Opinions and Experiences: Sharing individual thoughts and stories is
common.
● Brand Voice: Organizations develop a specific tone and style for their
communications.
7. Noisy and Unstructured Data: This informality and rapid creation can lead to:
● Grammatical Errors and Typos: Less emphasis on formal writing rules.
● Code-Mixing: Especially in multilingual societies, mixing of languages within a
single post is common.
● Ambiguity and Context Dependence: Meaning can be heavily reliant on the
surrounding conversation or shared understanding.
1. Misinformation and Disinformation: The speed and reach of social media make it
easy for false or misleading information to spread rapidly.This can have serious
consequences, impacting public opinion, health, and even safety.
2. Lack of Nuance and Context: The brevity of social media text can strip away
important context and nonverbal cues. This can lead to misunderstandings,
misinterpretations, and even conflict.Sarcasm, for example, is often lost in translation.
3. Cyberbullying and Harassment: The anonymity and public nature of many social
media platforms can create an environment where cyberbullying and harassment thrive.
This can have severe emotional and psychological impacts on individuals.
4. Privacy Concerns: Sharing personal information on social media, even in seemingly
innocuous posts, can lead to privacy breaches and data vulnerabilities. Information can
be used in ways users didn't intend or anticipate.
6. Addiction and Time Management: The design of social media platforms often
encourages compulsive use, leading to addiction and significant time wasted. This can
detract from productivity, real-life relationships, and overall well-being.
7. Filter Bubbles and Echo Chambers: Algorithms on social media often prioritize
content that aligns with a user's existing beliefs and interests. This can create filter
bubbles or echo chambers, limiting exposure to diverse perspectives and potentially
reinforcing biases.
9. Data Overload and Noise: The sheer volume of social media text can be
overwhelming, making it difficult to find relevant and reliable information. Spam,
irrelevant content, and off-topic chatter contribute to this "noise."
10. Challenges for Natural Language Processing (NLP): The informal language,
abbreviations, slang, typos, and code-mixing common in social media text pose
significant challenges for NLP techniques used for analysis like sentiment analysis or
topic modeling.
2. Text Cleaning: This is where you tackle the nitty-gritty of making the text more
uniform and analyzable.
● Lowercasing: Converting all text to lowercase (e.g., "Hello" becomes "hello")
helps ensure that words are treated the same regardless of capitalization.
● Removing Punctuation: Punctuation marks (like commas, periods, question
marks) usually don't add much to the meaning of the words themselves in most
analyses.
● Handling Special Characters: Removing or replacing special characters (like
emojis, symbols) depending on your needs. Emojis, for instance, can carry
sentiment, so you might want to handle them differently (e.g., converting them to
text descriptions).
● Dealing with URLs and Mentions: You might want to remove URLs or replace
them with a placeholder (like "[URL]"). Similarly, you could remove or mark user
mentions (@usernames).
● Handling Hashtags: Decide whether to keep hashtags, remove the "#" symbol
and treat them as regular words, or extract them for separate analysis. Hashtags
often carry valuable information about the topic or sentiment.
● Addressing Numbers: Depending on your analysis, you might remove numbers
or convert them to a standard format.
4. Feature Engineering (Optional but Often Crucial): After the basic cleaning and
normalization, you might want to create new features from the text that can be useful for
your analysis.
● N-grams: Considering sequences of words (e.g., "very good," "not happy"). This
can capture more contextual meaning than individual words.
● Sentiment Scores: Using sentiment analysis tools to assign a sentiment score
(positive, negative, neutral) to each piece of text.
● Topic Modeling: Identifying the main topics discussed in the text.
● Word Embeddings (Word2Vec, GloVe, FastText): Representing words as
dense vectors in a high-dimensional space, capturing semantic relationships
between words. These are often used as input for more advanced machine
learning models.
There are primarily two main categories of approaches, often used in combination:
1. Rule-Based Approaches:
● Keyword Lists: This involves creating lists of offensive words, phrases, and
slurs. When these keywords are detected in a text, it can be flagged as
potentially aggressive or abusive.
● Regular Expressions: More sophisticated pattern matching can be used to
identify variations of offensive terms (e.g., using wildcards or character
substitutions).
● Limitations: Rule-based systems are often brittle, meaning they can be easily
bypassed by slight variations in language or the use of novel offensive terms.
They also struggle with context and may flag non-offensive language if it
happens to contain a keyword.
These are the dominant approaches due to their ability to learn complex patterns and
handle nuanced language.
● Deep Learning:
Approaches to Detection:
● Hate Keyword Lists: These lists contain slurs, derogatory terms, and phrases
specifically targeting protected groups.
● Pattern Matching: Identifying patterns that indicate hateful intent, such as
combinations of negative sentiment words with group identifiers.
● Limitations: Suffers from the same issues as general rule-based systems
(brittleness, context insensitivity, evasion).
1. Targeted Groups: This is perhaps the most common way to categorize hate speech,
focusing on the specific groups being attacked. Examples include:
● Mild Insults: While offensive, they may not necessarily incite hatred or violence.
● Severe Vilification: Language that strongly denigrates and dehumanizes.
● Incitement: Explicitly or implicitly encouraging violence or discrimination.
4. Public vs. Private: While the focus is often on public expressions, hate speech can
occur in private settings as well. However, legal and social responses often differ.
Publicly expressed hate speech is generally considered more serious due to its
potential to influence a wider audience.
5. Online vs. Offline: The internet and social media have provided new avenues for the
rapid dissemination of hate speech, often with a sense of anonymity.
Cyberbullying Detection
Cyberbullying detection systems use machine learning to identify and flag potentially
abusive content on social media or online platforms. These systems analyze text,
images and sometimes video data to detect patterns and language that suggest
cyberbullying behaviour.
Hybrid Approaches:
Combining rule-based methods with traditional machine learning can be effective.
For example:
● Use rule-based systems to flag potentially problematic content based on
keywords or patterns.
● Then, use a machine learning classifier with more sophisticated features to
further analyze the flagged content and make a final determination.
Overall, detecting revenge posts without deep learning relies on carefully crafted rules
and the extraction of relevant features from the text and metadata to train traditional
machine learning classifiers. While these methods have limitations in handling nuanced
language and context compared to deep learning, they can still be effective when
well-designed and continuously updated.
1. The Problem
6. Conclusion