webscience

Revision

Topic 1- Social Media (Twitter) Crawl

L1 TwitterData

Exploitation of the twitter data:

With the amount of content posted to social media websites every day, such as Twitter, there is huge potential for its exploitation in many scenarios, such as:

Sports and Finance: Stock market prediction & Sports betting

predict stock market changes based on the sentiment of tweets

  • Sentiment expressed in tweets had been used to predict stock market reactions
  • Opinions of investors’ posts on social media
  • combine with the stock prize movements

Sports betting

Web Science Concepts

  • Explore the science underlying the web

    • From a socio-technical perspective
    • Mathematical properties, engineering principles, social impacts
  • Understanding users and developing Web applications for them!

    • Sociology & Web Engineering
  • Consulting corporations about social media activities

    • Economics & Web analvtics
    • For example, role of micro influencers on local economy
    • How much hate a brand page generates due to some comments, ….
  • Data analvtics

    • Growth of information; structured and unstructured
    • Intersection of networks & data
  • Broble mew technologies to scientists and engineers working together on large scale

L2 DataClustering

Content processing

1. Removing stuff

  • Non-ascii removal

remove the emoji…

image-20230429103013009

2. Grouping tweets

  • Based on content analysis like
    • Clustering, locality sensitive hashing
    • Or through content indexes
  • Once we know the groups
    • We could analyse the words, user mentions, hashtags in these groups
    • We can add these terms to a list with a priority
    • This is possibly for identifying more tweets of this type
      • Aim is data gathering
  • We can also look at proficient tweeters
    • What are their total tweets

3. Tokenization

Separate each token

Remove stopwords

4. Vector Representation

Documents are represented by a term vector

Di = (ti1,ti2 ,….,tin )

Queries are represented by a similar vector

• In binary scheme, tik is set to 1 when term k is present in Document i otherwise they are set to zero

  • The most relevant documents for a query are expected to be those represented by the vectors closest to the query, that is documents that use similar words to the query.
  • Closeness is often calculated by just looking at angles between document vector and query vector
  • We need a similarity measure!
    • Cosine similarity measure
    • Jaccard coefficient
    • Dice coefficient

Finding similar tweets Single-pass clustering

Single-pass clustering

  • requires a single, sequential pass over the set of documents it attempts to cluster.

  • The algorithm classifies the next document in the sequence according to a condition on the similarity function employed.

  • At every stage, the algorithm decides on whether a newly seen document should become a member of an already defined cluster or the centre of a new one.

    • In its most simple form, the similarity function gets defined on the basis of just some similarity (or alternatively, dissimilarity) measure between document-feature vectors.

image-20230429121645868

Comments on Single Pass Method

  • The single pass method is particularly simple
    • since it requires that the data set be processed only once.
  • Obviously, the results for this method are highly dependent on the similarity threshold that is used.
  • It tends to produce large clusters early in the clustering pass,
    • and because the clusters formed are not independent of the order in which the data set is processed.
    • You should use your judgment in setting this threshold so that you are left with a reasonable number of clusters.
  • It is sometimes used to form the groups that are used to initiate reallocation clustering.
    • If we get a large noisy clusters of tweets, we could re-cluster them!!!

L3 Credibility & Newsworthiness

Newsworthy

Key characteristics of newsworthy score

  • Real-time
    • Tweets should be scored as soon as it arrives!
  • Generalizability
    • Should be able to handle any types of events - not those just seen before
  • Adaptive
    • New information to be incorporated, as and when they arrive
    • Incorporate new information to the scoring model

These characteristics should be realized with the help of classification approach and distant supervision.

HeurisIc Labelling

  • Semi-automatic labelling approach
    • Using a set of heuristics to label
      • High quality (newsworthy) and low quality (noisy) content
  • This will not label majority of the content
  • Advantages
    • Minimal effort in creating a data set
    • Real-life data set - incremental and generalizable
    • Easily built as part of an algorithm for example event detection

Overall approach

  • Collect a set of high-quality sets and low-quality sets of data

  • Use this dataset to potentially score a newsworthy tweet

Quality Score

Quality Score = (profileWeight + verifiedWeight + followers Weight + accountAgeWeight + descriptionWeight)/5

Range is [0 to 1]

if q score is higher than 0.65 -> high quality

if q score is lower than 0.45 -> low quality

Scoring model

Likelihood ratio for each term

  • R(t) = relative importance of term in the particular quality model when compared to random background model
  • >1
    • Term is more common in the model than random
  • <1
    • Term is less common in the model than random

image-20230429134512747

cntd

if RHQ(t)<2 or RLQ(t)<2 then the Score will be set 0 as to remove the terms which have no clear association with either high quality content or low-quality content.

image-20230429134931099

L4-Geo-localisation

Fine-grained localization

Fine-grained localization refers to the task of accurately localizing objects or entities within an image or a video with high precision, usually at a sub-pixel or sub-object level. This involves identifying the precise location of the object, as well as any associated attributes such as shape, texture, and color.

The goal of fine-grained localization is to provide more detailed and accurate information about the location and properties of objects in an image, which can be useful for a range of applications such as object tracking, object recognition, and scene understanding.

Fine-grained localization can be challenging due to the variability in object appearance, pose, and scale. To overcome these challenges, various techniques such as deep learning and computer vision algorithms have been developed.

Examples of fine-grained localization tasks include localizing individual bird species in a bird-watching image, identifying specific car models in a crowded parking lot, or detecting the presence of a particular species of fish in an underwater video.

Problem statement

  • Geo-localization
    • Provide location estimates for individual tweets
  • coarse-grained Geo-localization
    • Provide location estimates for individual tweets at regional or country level
  • fine-grained Geo-localization
    • Provide location estimates for individual tweets at city or neighbourhood level
  • Approach
    • Train a model on a geo-tagged data set
      • Validate and test on geo-tagged data
      • Test on non-geo tagged data as well

Topic3 Topic modelling

Topic modelling

Discuss why searching is limited when exploring a collection

When exploring a collection, searching can be limited by various factors such as the completeness and accuracy of the metadata, the complexity of the query, and the quality of the search algorithm.

Firstly, the completeness and accuracy of the metadata associated with each item in the collection can limit the effectiveness of searching. If the metadata is incomplete or inconsistent, important information about an item may not be captured, making it difficult or impossible to find through search queries.

Secondly, the complexity of the search query can also limit the effectiveness of searching. For example, if a user is looking for items that have multiple attributes or characteristics, such as a specific color and shape, the search query may become too complex and difficult to execute accurately.

Finally, the quality of the search algorithm used to explore the collection can also limit the effectiveness of searching. If the algorithm is not designed to handle the specific characteristics of the collection or the query, it may return irrelevant or incomplete results.

To overcome these limitations, various techniques can be employed such as using natural language processing to simplify complex search queries, improving the quality and completeness of metadata through manual curation or machine learning techniques, and using advanced search algorithms that take into account the specific characteristics of the collection and query.

Topic modelling is the process of using a topic model to discover the hidden topics that are represented by a large collection of documents

  • Observed

    • collection

    • Document & words

  • Aim

    • Use the observed information to infer
      • Hidden structure
  • Topic structure - hidden

    • per document topic distributions
    • Per document per-word topic assignments
      • Annotation…
      • Can be used for retrieval, classification, browsing?
  • Utility

    • Inferred hidden structure resembles the thematic structure of the collection

latent

(of a quality or state) existing but not yet developed or manifest hidden or concealed.;

Topic modelling

  • A machine learning approach for Mining latent topics

Identify hidden, gigantic structures

Probabilistic topic models

  • a suite of algorithms that aim to discover and annotate large archives of documents of thematic information

Our goal in topic modelling

  • The goal of topic modeling is to automatically discover the topics in a collection of documents
  • Documents are observed
    • Topics, per-document and per-word topic assignments - hidden
    • Hence latent!
  • The central computation problem for topic modelling is to use the observed documents to infer hidden topic structure
  • Think it as reversing the generative process
    • What is the hidden structure that likely generated the observed collection?

LDA (Latent Dirichlet Allocation)

  • Is a statistical model of document collections that tries to capture the intuition
    Each document can be described by a distribution of topics and each topic can be described by a distribution of words

  • Topic

    • Defined as a distribution over the words/ fixed vocabulary

      • E.g., genetic topic has words about genetics (sequenced, genes) with high probability

      • Evolutionary biology has words like life, organism with high probability

Topic Modelling Approaches

  • Number of possible topic structures is exponentially large
  • Approximate the posterior distribution
  • Topic modelling algorithms form an approximation of equation,
    • by adapting an alternative distribution over latent topic structure to be close to the true posterior

Two approaches:

1. Sampling based!

Attempt to collect samples from the posterior to approximate it with an empirical distribution – Gibbs sampling!

2. Variational methods!

Deterministic alternative to sampling based methods

Posit a parametrised family of distributions over the hidden structure and then find the member of that family that is closest to the posterior

Summary

The user specifies that there are K distinct topics

​ Each of the K topics is drawn from a aDirchlet distribution

​ Uniform base distribution (u) and concentration parameter B

​ theta_k ~ Dir(Bu)

Distributions over topics of each document theta_d ~ Dir (au)

  • Topic assignment Z_d,n ~ Discrete(theta_d)
  • Wd,n~theta_Z(d,n)

Sampling based!

Attempt to collect samples from the posterior to approximate it with an empirical distribution - Gibbs sampling!

Variational methods!

Deterministic alternative to sampling based methods

Posit a parametrised family of distributions over the hidden structure and then find the member of that family that is closest to the posterior

Gibbs Sampling

It generates samples from complex, high-dimensional probability distributions.

The algorithm for Gibbs sampling is as follows:

  1. Initialize each variable with an initial value.
  2. Choose a variable to update, say the i-th variable.
  3. Sample a new value for the i-th variable from its conditional distribution, given the current values of the other variables.
  4. Repeat steps 2 and 3 for a specified number of iterations or until convergence is achieved.

LDA

  • LDA is a probabilistic generative model

    • Each document is a distribution of topics
    • Each topic is a distribution of words
  • Sample a topic from a document-level topic distribution

    • That obeys Dirichlet distribution
  • Then sample a words according to the topic distribution of this topic

    • Dirichlet distribution
  • Generate a document

  • Hence, LDA implicitly model document level word co-occurrence pattern

  • Sparsity problem exacerbates performance issues

    • The limited contexts make it more difficult for topic models to identify the senses of ambiguous words in short documents.
  • Short documents like Tweet

    • Sparse words …
    • Time-sensitive
    • Lack of clear context because not much information; no formal structure
      • How words are related often measured through co-occurrence patterns
      • In short texts document-level co-occurrence patterns are difficult to capture
  • Massive volume of tweet;

    • Memory requirements
  • Real-time nature

Problem with LDA to train twitter

Why not LDA?

  • LDA needs to be trained on the entire data sets
    • Memory requirements for the model
  • LDA is trained and tested on a data set
    • Time-sensitive nature of Twitter

How to address this issue …?

  • Enrich the word co -occurrence information
    • To enrich the limited word co -occurrence information contained in a single short text,
  • Make larger texts by grouping short texts (tweets)
    • Grouping tweets by the authors
    • However, this aggregation method highly depends on the meta - information of each text,
      • which may not always be available for many kinds of short texts.
    • Another strategy models the similarity between texts to aggregate similar texts into a long pseudo -document
  • Explicit text similarity

Topic4 Network Analysis

L6 Graph-based Network Analysis

Graph Modelling

  • Capturing structural properties of social networks
  • Relationships formed between individuals
    • To identify clusters
    • Cliques and connected components of users
    • Centrality measures
  • Hashtags
    • Which hashtags are strongly connected

Centrality Measures: Find the influential users, find the centre of a graph.

Why graph?

  • By analyzing network data, we can ask many questions

    • Who is most important in a network?
    • Which way information flows?
  • We can use graph analysis to answer questions like these

  • Note

    • Sample questions!
    • What are people talking about?
      How are they responding to a product?
    • The breadth of such analyses is huge and not covered fully

Graph Theory

Graph

  • Graphs are way to formally represent a network or a set of interconnected objects

  • Nodes and edges

  • Unlike trees no concept of root node, In the graph, there is no unique node which is known as root.

  • One node might be connected to five others!

  • No concept one-directional flow!

Edges

  • With direction or flow!
  • Without direction!

Direction

  • Origin to destination

Trees vs graphs

A tree is a set of nodes and edges. In a tree, there is a unique node which is known as root.

Terminology - Undirected graphs

  • u and v are adjacent if {u, v} is an edge,
    • e is called incident with u and v
    • u and v are called endpoints of {u, v}
  • Degree of Vertex (deg (V)):
    • the number of edges incident on a vertex.
    • A loop contributes twice to the degree

Terminology - Directed graphs

  • For the edge (u, v), u is adjacent to v OR v is adjacent from u,
  • u - Initial vertex origin)
  • v - Terminal vertex destination)
  • In-degree (deg (u)): number of edges for which u is terminal vertex
  • Out-degree (deg+ (u)): number of edges for which u is initial vertex

Incidence Matrix:

What are the maximum potential edges?

Undirected graph: (n*(n-1))/2

Directed graph (n*(n-1))

Edge density = no of edges/ max no of edges possible

Adjacency Matrix

  • There is an N x N matrix, where |V| = N,

  • the Adjacenct Matrix (N×N)

  • This makes it easier to find subgraphs

  • When there are relatively few edges in the graph the adjacency matrix is a sparse matrix

Graph analysis in twitter

Twitter is directed graph while facebook is undirected

Why graph analysis?

  • By analyzing tweet data, we can ask many questions
  • Who is most important in a network?
  • How did the information flow?
  • How could we reach 50% of the graph?
  • Who is more influential?
  • What are people talking about?
  • How are they responding to a product?

Centrality

find who is important

  • Measures of importance in social networks are called centrality measures

Degree centrality

  • Who gets the most re-tweets?

    • Basically says who is most important in the network
  • In-degree: number of retweets of a user

  • Out-degree: number of retweets this particular user made

  • The degree centrality is a fundamental metric in network analysis.

  • It is used to directly quantify the number of nodes in the network that a given node is adjacent to.

  • for directed networks

    • There are variations of this metric that are used, where connections between nodes have a directionality.
  • In directed networks, it makes sense to talk about the following:

  • In-degree - For a given node, how many edges are incoming to the node.
    the node.

  • Out-degree - For a given node, how many edges are outgoing from

  • CD(v) = deg(v)

Centrality measures!

Designed to characterise

  • Functional role - what part does this node play in system dynamics?
  • Structural importance - how important is this node to the structural characteristics of the system?

In each of the following networks, X has higher centrality than Y according to a particular measure

image-20230430204648900

  • “Who is the most important or central person in this network?”
    • There are many answers to this question, depending on what we mean by importance.
  • The power a person holds in the organization is inversely proportional to the number of keys on his keyring.
    • A janitor has keys to every office, and no power.
    • The CEO does not need a key: people always open the door for him.
  • Degree centrality of a vertex

image-20230430205230578

image-20230430210025438

Eigenvector centrality

  • Who is the most influential

  • In contrast to degree centrality

    • How important are these retweeters?
  • is a measure of the influence of a node

    • It assigns relative scores to all nodes in the network based on the concept that
    • connections to high-scoring nodes contribute more to the score of the node in
      question
      • than equal connections to low-scoring nodes.
  • Google’s page rank is variant of the eigenvector centrality.

  • Using Adjacency network

    • Ax = lambda x
    • there is a unique largest eigenvalue, which is real and positive,
    • This greatest eigenvalue results in the desired centrality measure.

Betweenness Centrality/Closeness Centrality

Betweenness centrality measures the number shortest paths in which the user is in the sequence of nodes in the path.

  • It was introduced as a measure for quantifying the control of a human on the communication between other humans in social network.
  • In this conception, vertices that have a high probability to occur on a randomly chosen shortest path between two randomly chosen vertices have a high betweenness.

image-20230430213515337

Closeness Centrality: Definition

  • Closeness is based on the length of the average shortest path between a vertex and all vertices in the graph
  • Closeness Centrality

L7_Retweet Graph Trend & Influencers

Information diffusion ….

tracing, understanding and predicting how

a piece of information is spreading.

Information diffusion

  • in online communities, tracking the information diffusion is useful for many applications, for example,

    • such as early warning systems,
    • social bot and community detection,
    • user location prediction,
    • financial recommendations,
    • marketing campaign effectiveness,
    • political mobilization and protests
  • Twitter offers four possible actions to express interest in specific content:

    • favorite, reply, quote and retweet.
    • Replying or liking a tweet does not involve the spread of the content,
    • whereas quotes and retweets are actions used to
      • share information with a wider audience.
  • A retweet is often considered an endorsement, i.e., the user supports the original tweet’s content,

  • whereas quoting may be done in order to express a different idea

Hashtags&mentions

  • Hashtag - adding a “#” to the beginning of an unbroken word or phrase creates a hashtag.
    • When you use a hashtag in a Tweet, it becomes linked to all of the other
      Tweets that include it.
    • Including a hashtag gives your Tweet context and allows people to easily follow topics that they’re interested in.

@Mentions are used when talking to or about someone (the user account of a person, brand, group, etc.)

  • In marketing
  • Using hashtags helps a brand connect with what’s happening on Twitter.
    When brands connect with what’s happening on Twitter, they see
  • lifts across the marketing funnel, such as
    • +18% message association, +8% brand awareness, and +3% purchase intent

Twitter REST API

  • The user Tweet timeline endpoint is a REST endpoint that receives a single path parameter to indicate the desired user (by user ID).

    • The endpoint can return the 3,200 most recent
      • Tweets, Retweets, replies, and Quote Tweets posted by the user.
  • User mention timeline

    • The user mention timeline endpoint allows you to request Tweets
    • mentioning a specific Twitter user, for example,
      • if a Twitter account mentioned @TwitterDev within a Tweet
  • it is possible to collect a huge amount of information regarding tweets, accounts, users’ timelines and social networks (i.e., following and followers).

Interaction among users

  • In order to understand the connections among users, it is important to consider not only their social networks
    • but also, the way they interact, especially through retweets
  • the Twitter API does not provide complete information about retweets and their propagation paths.
    • More precisely, the only information carried by a retweet is the original user
  • To estimate retweet cascade graphs,
    • many strategies based on social network information
    • (i.e., friends and followers) in conjunction with temporal information

Retweet Graph

  • a graph of users, where an edge

    • means that one of the users has retweeted a message of a different user.
  • retweet graph G = (V, E),

    • which is a graph of users that have participated in the discussion on a specific topic.
    • A directed edge e = (u, v) indicates that user v has retweeted a tweet of u.
    • Or e = (u, v) indicates that user u has retweeted a tweet of v.

image-20230430225752042

Count the number of links to a node in the network

  • the number of directed edges with source (destination)
  • In-degree - number of retweets of a user
    Out-degree - number of retweets this particular node (user)
    made

observe the retweet graph at the time instances t = 0, 1, 2, . ..,

  • where either a new node or a new edge was added to the graph,
  • G, = (V+, Et) the retweet graph at time t

Issues in building interaction graph

  • prior studies exploited the fact that users tend to interact more often with newer tweets,
    • and a user is more likely to retweet the last friend who retweeted content.
  • However, this approach is no longer a reliable way of estimating retweet graphs,
  • Since, Twitter does not show content based on
    • simple reverse chronological order,
    • but according to user interests, trending topics and interactions
  • fetch all the required social network information.
    • the time required to fetch all
  • Due to the Twitter API rate limits, the time required to collect the list of friends and followers is
    • six times greater with respect to downloading the user’s timeline on average.

Some findings

  • analyse the spread mechanics of content through hashtag use
  • and derive probabilities that users adopt a hashtag.
  • Hash tags tend to travel to more distant parts of the network and
  • URLs travel shorter distances.

Random graph model

  • Super-star random graph node for a giant component of a retweet graph.
  • users with many retweets have a higher chance to be retweeted,
  • however, there is also a super- star node that receives a new retweet at each step with a positive probability.
  • Trending topics
    • Ongoing topics that become suddenly extremely popular
  • detecting different types of trends, for instance
    • detecting emergencies,
    • earthquakes,
    • diseases or important events in sports.
  • An important part of trending behaviour in social media is
  • the way these trends progress through the network.

Two options

  • Content of the tweets discussing a topic

    • How do we find this?
  • Underlying networks describing the social ties between users of
    Twitter

    • a graph of users, where an edge means that one of the users has retweeted a message of a different user.
  • In both cases, we could ask

    • How big or small interaction network compared to followers’ network?

    • What kind of information goes through the network?

      • 85% of content is/was News!

Largest connected component LCC

  • LCC refers to a maximum set of nodes
  • such that you can move from any node to any other node
  • in this set by only moving between side-adjacent nodes from the graph.

All components of a graph can be found by looping through its vertices,

starting a new breadth-first or depth-first search whenever the loop reaches a vertex that has not already been included in a previously found component.

Graph density

  • represents the ratio between the edges present in a graph and the maximum number of edges that the graph can contain.
  • Conceptually, it provides an idea of how dense a graph is in terms of edge connectivity.
  • In this work |E|/|V|

size of the largest connected component (LCC) and its density are the most informative characteristics for predicting the peak in Twitter.

Information diffusion & influencers

Issue with retweet graph

  • users might be exposed and influenced by a piece of information by multiple users, hence forming multiple influence paths

  • When a message arrives that is a retweet, every friend that has (re)tweeted at an earlier point in time has to be considered as a potential influencer

  • there is no agreement on the minimum number of followers needed to be regarded as an “influencer”

  • In fact, in marketing, they talk about

    • Micro-influencers

Influence paths express the relationship of “who was influenced by whom”.

The set of influence paths form a social graph, that share a common root (a single user who first seeded a tweet). Influence path is referred as “information cascade”. A cascade is formed when users forward the same original message from a user that we call the root user.

Information cascade model

  • how information is being propagated from user to user from the stream of messages and the social graph.

  • Nodes of the cascade represent users (user nodes) of the social network that got “influenced” by the root or another user.

  • Edges of the cascade represent edges of the social graph over which influence actually spread.

An “influencer” in the case of Twitter is the so called “friend” that exposes information to his/her followers and

exerts influence on them in such a way that they forward this piece of information.

However, real data is missing?

we can derive these influence paths from these social connections among users.

Absolute Interaction strength

image-20230501103443357

Retweet distribution

  • The retweet distribution given the time delay between the retweet action date and the original tweet posting time
    • for 16,304 cascades
  • Temporal dynamics of the retweets after the respective roottime
  • Showing a decreasing trend,
    • as the highest number of interactions occurred
    • soon after roottime (the original tweet creation date).

Weighted information strength

image-20230501103700956

image-20230501103748922

Approach - Generating Retweet Cascade Graphs

image-20230501103852547

No interaction

image-20230501103945037

In addition

  • when there are no available interactions by a user u, and thus
  • no IS values between u and any other user,
  • An alternative is to find a link from the u to another user in the cascade
  • collect the user’s friend list by using the Twitter API, and
  • every user’s friend that has retweeted at an earlier point in time is considered as a potential influencer
  • To identify the influencer that more likely spread the tweet to user u,
  • consider the most recent influencer, i.e., u is linked to the last friend that retweeted the message.
  • Users that still remaining without an edge after this second step are denoted as sparse nodes
    (SN).

How do you find influencer nodes and communities?

  • How could we find important nodes?
    • Influencers?
  • How can we find the information paths?
    • What measures you may use? Centrality, Degrees.
  • What alternative mechanisms to weight graphs?

L8 Network Analysis - Case studies in Health Communities

Case studies

How online communities of people with long-term conditions function & evolve: network analysis of the structure and dynamics of the asthma UK and British lung foundation online communities,

Problems

  • We have seen

    • People express themselves through social media

    • Huge amount of data

  • People suffering from mental health

    • Silent suffering!
  • Can we create a self-management or self-diagnosis tool

    • A tool helping them to control their situation

    • A tool nudging them to get help!!

  • To start with

    • How prevalent is mental health issues in society?
  • Can we mine network structure of social media to understand how communities support mental health issues?

    • Network structure of social media data provide insights on support given by society on mental health issues

Social Support

Social Support is an exchange of resources between two individuals

  • perceived by the provider or the recipient to be intended

  • to enhance the well-being of the recipient

    • e.g.,

      • Facebook interaction

      • RTs …

How to extract signatures of perceived social support?

Post-reply Network

Example -> StackOverflow

It is not like twitter, its form is due to the users shared same interest.

A community expertise network.

Graph modelling WHAT IS INTERACTION GRAPH

User interaction graph

Tie

  • Tie connect a pair of users/actors by one or more relations

    • Sharing information, financial or Psychological support

    • One relation or multiple set of relations

    • Vary in content, direction & strength

  • look at the actual tie between users instead of message level interactions

Structural prestige in online communities

  • A thread

    • How many people a user replied (out degree!!)
    • How many people replied to the user (in degree!!)
  • In directed networks

    • People who send many responses/replies

    • are considered to be prestigious

    • Or person with knowledge

Size of a node

  • the size of the node depends on the number of replies send by the user
  • The more the number of replies, the larger the node

How to create an interaction graph?

  • the users who post and the corresponding reply users

  • A Pandas DataFrame contains two columns of node names

    • posts_author, comments_author.
  • use nx.from pandas_ edgelist() to transfer the DataFrame to a network graph.

    • Network (NX) is a python software package,

    • used to create and operate complex networks,

    • and to learn the structure, dynamics and functions of complex networks.

  • To distinguish the importance of each user who post,

    • we set the size of these users’ to twice its node degree
      User’s node orange

    • Replies node to light gray the directed edges to light blue colour.

Temporal activity patterns

  • Let us study how the community thrive?
    • How do they function and evolve over time?
  • Basically we are answering research questions
    • like “ what is the basic structure of SUCH online communities and how do they function and evolve over time”

Degree distributions

  • Look at the distribution of degrees,

  • Or the amount of edges a node has

  • Acros sall nodes in a graph

  • Top 1% of nodes

    • in terms of degrees

    • most interactive and have
      Established edges with a lot of other users by exchanging messages

    • Known as super users,

Activity analysis

  • How are users engaging or community thrive?

    • Does posting activity follow a time pattern?
  • How many activity is happening on a daily or weekly basis on a particular community

    • Number of messages exchanged in a community across the whole life cycle of data

    • how users engaging with a community

  • Cumulative frequencies of activity

    • Number of posts reply per week

To understand the behaviour of the community

  • the trend tends to be linear,

    • which indicates that the number of new replies per week remains stable.
  • SuicideWatch, average 211605 posts to come weekly and PSTD average 1980 posts

  • This shows that the average weekly posting volume of Suicide Watch community users is almost 100 times that of PTSD.

    • 227,307 users in the SuicideWatch community, while PTSD has only 50032 users,

Open question – how do we distinguish two communities?

Modularity Optimization: Modularity is a measure of the degree to which nodes in a network are connected within their own community compared to the connections between different communities. Modularity optimization involves identifying communities that maximize modularity, or in other words, communities that have a high degree of internal connections and a low degree of external connections.

Girvan-Newman Algorithm: This algorithm involves iteratively removing edges from the network in order of their “betweenness centrality,” which is a measure of how often a given edge lies on the shortest path between two nodes. By removing edges in this way, the algorithm gradually breaks the network into smaller and smaller communities.

Label Propagation: This method involves assigning an initial label to each node in the network and then iteratively updating the labels based on the labels of neighboring nodes. Over time, nodes tend to cluster into groups with similar labels, forming communities.

Spectral Clustering: This method involves using the eigenvalues and eigenvectors of the network’s adjacency matrix to identify communities. By projecting the network onto a lower-dimensional space, spectral clustering can often separate nodes into distinct communities.

how did we study the behaviour of two communities & what did we found?

  • For each week

  • Compute average post

    • Look at the total posts by all users

    • Divided by total number of unique users

  • How users are engaging with the community?

    • A continuous engagement is good for the vitality of community.

    • Do these communities drive enough engagement and activity to sustain

Super users

  • A small minority of users

    • Responsible for a high proportion of posting activity and thus support

    • Functioning of communities

  • 5% of users generate

    • Over 70% of content
  • How do we study the role of super users?

    • Sensitivity analysis

How to find the super user?

  • For each user

    • Count the number of posts (A) and

    • The number of replies (B)

    • The number of total activity (A + B)

  • Rank the user in terms of respective frequencies

Connected component

  • A connected component of an undirected graph is a maximal set of nodes such that
    • each pair of nodes is connected by a path.
  • Directed graphs have weakly and strongly connected components.
  • Two vertices are in the same weakly connected
    component
    • if they are connected by a path, where paths are allowed to go either way along any edge.
  • The weakly connected components correspond closely to the concept of connected component in undirected graphs and the typical situation is similar
    • there is usually one large weakly connected component plus other small ones.

Largest connected component

  • A largest connected component of a

    • Graph G(V,E) is the largest possible subgraph

      • G_L(V_L,E_L) of G,

      • such that each node in G_L, has at least one valid connected path to every other node in G_L,

  • LCC gives us the subset of users

    • Who form a cohesive community
  • Importance of super users on LCC

    • By removing them and studying the cohesion

Community resilience

Temporal Analysis

  • Characteristics of LCC on a weekly basis

  • Focused and cohesive nature of interactions

    • By looking at the fraction of users belonging to the LCC.
  • Our aim is to study

    • community resilience
  • First how cohesive

    • The community is?
  • For each weekly graph, G

    • Compute the LCC

    • That is all nodes in LCC has at least one path

  • Compute fraction N/N

    • Nk is the nodes in LCC

Fragility of the community

  • If the conversation network is held by

    • A more or less uniform contribution of nodes

    • Or

    • Is there a skew in the responsibility of nodes

  • Sensitivity analysis methods

    • Which measures the network’s capacity to diffuse information as you move nodes based on certain property
  • Importance of super users

Sensitivity analysis

  • the targeted removal of nodes (users) starting from the most connected nodes.
  • represents the size of the largest component as a percentage of the network size.
  • Specifically, it illustrates the key effects of the superusers on the website from another perspective.

Rich club effect

  • that a few important nodes (users) show stronger and closer connection with each other,

    • and constitute a structural core and functional hub.
  • the rich-club coefficient

    • is the ratio of the actual number of edges of nodes with order greater than k
    • to the number of potential edges of each order k

image-20230501161132105

  • the coefficient continues to be lower than 1,
    • indicating that the amount of interaction between superusers are not high,
    • the amount of interactions between most non-superusers are also not high.
  • the interactions between the superusers and non-superusers are very high,
    • indicating that superusers are more inclined to communicate with users with fewer interactive connections.
  • How do we explain this
    • there are a large number of users with purposeful questions on the website and a small number of experts in the field.

Z-score

  • We have seen core users and their relationship with other users from the graph

  • We do not know whether core users

    • Tend to ask for help (post more)

    • Help others (reply more

  • Look at a thread!

  • To find that out let us look at z-score!!
    z = (х - mean)/sd

Emotion Analysis

L9

Sentiment analysis & variants

Variants

  • Sentiment classification
    • whether a piece of text is positive, negative or neutral
    • Degree of intensity
      • [-100,100]
  • Opinion analysis
    • Determining from text, the speaker’s opinion and target of the opinion
  • Stance
    • Author of text is in favour of, against of, or neutral towards a proposition or target
    • For example, Brexit agreement
      • Is people supportive?
  • Emotion
    • What are the emotion expressed in the text?

Sentiment vs Stance

  • Target:
    • Legalization of Abortion
  • Tweet
    • The pregnant are more than walking incubators. They have rights too!
    • In favour of the target
    • Target - Pro-life movement
      • ??
  • Target
    • Donald Trump
  • Tweet
    • Donald Trump has some strengths and some weakness
      • neutral

Stance detection

  • Is the task of automatically determining from text
    • whether the author of the text is in favour of, against of, or neutral
    • toward a proposition or target
  • Target
    • Person, organization, government policy, a movement, a product
    • E.g., infer from former Prime minister Boris Jonson’s speeches that he is in favour of Brexit
    • E.g., analysing tweets identify people in favour of leadership change

Aspect Based Sentiment Analysis

  • A sentence contains one or more entities,
    • each of which has a different polarity of emotion.
  • For example, give a comment like
    • “Great food but the service is dreadful!”,
    • the emotional polarity of entity “food” is “positive”
    • while the emotional polarity of entity “service” is “negative”.
  • Compared to sentence level sentiment analysis, ABSA can present
    • users with more precise and fine-grained sentiment information of entities
  • You identify an aspect and the sentiment towards that aspect

Sentiment classifification is limited

  • Language serves social and interpersonal functions.
    • Affective meaning is key for human interaction and a prominent characteristic of language use.
  • This extends beyond
    • opinions vs. factual or polarity distinctions
    • into multiple phenomena:
      • emotion, mood, personality, attitude, credibility, volition, veracity, friendliness, etc.
  • Emotion:
    • angry, sad, joyful, fearful, …
  • recognition, characterization, or generation of affect states,
    • involves analysis of affect-related conditions, experiences, and activities.

Textual emotion

  • We analyse the text and detect the emotion
    • expressed by the author or
    • the emotion potentially felt by the reader
  • Linguistic sensing of affective states can be used for
    • Understanding social issues expressed through social media
  • Researchers in psychological science believe that
    • individuals have internal mechanisms for a limited collection of responses, usually
    • happy, sad, anger, disgust, and fear

There are 6 emoMon categories that are widely used to describe

humans’ basic emoMons, based on facial expression:

anger, disgust, fear, happiness, sadness and surprise.

  • Categorical theories
  • EmoIons are discretely and difffferently constructed and
  • all humans are thought to have an innate set of basic emoIon
  • that are cross-culturally recognisable

OCC model

22 emotions = 6 Paul Ekman emotions + 16 addtional emotions

Criticism

  • In the categorical approach, emotional states are restricted to a limited number of distinct types and
  • it can be difficult to resolve a complex emotional situation or mixed emotions.
  • Appraisal theory
  • It contains componential emotion models based on the theory of appraisal.
  • Appraisal theory describes how different emotions, in various participants and on different times, can arise from the same event

Text emotion detection

motivation

  • because of the naturally vague and ambiguous human language is,
    • the emotion detection can be highly “context-sensitive and complex”
  • Emotion analysis is a convoluted task, even for human beings,
    • due to the various cultures, gender, and context of people who authored the texts.
    • The task will be much easier when emotion is expressed explicitly, but in reality,
    • the majority of texts are subtle, ambiguous, and some words have more than one meaning, and
    • more than one word expresses the same emotions, and, in addition, some emotions can exist simultaneously
  • Don’t u just HATE it when u cannot find something that u know you just saw like
    10 min ago!
  • By analysing the horse racing comments,
  • the model can learn about the information of winning horse and
  • would be able to give a reasonable prediction based on the emotion.

emotion classification

  • In the case of sentiment analysis, this task can be tackled using
    • lexicon-based methods, machine learning, or a reule-based approach
  • In emotion recognition task, the 4 most common approaches are
    • Keyword-based detection
      • Seed opinion words and find synonyms & antonyms in WordNet
      • WordNet
  • Lexical affinity
  • hybrid
  • Learning based detection

Lexicon

  • Lexicons are linguistic tools for automated analysis of text
  • Most common
    • Simple list of terms associated to a certain class of interest
    • Classification by counting
    • Terms can be weighted according to their strength of association with a given class

Lexicon based approaches?

  • “I love horror books”
  • f(love) × 1.0 + f(horror) x 0.3 + f(books) x 0.5 = 1.8 for positive
  • f(love) x 0.0 + f(horror) x 0.7 + f(books) × 0.5 = 1.2 for negative
  • Decision function
    • Classify one with maximum value
  • Transparency
    • Each prediction can be explained
      • Analysing terms that were present in the text

BERT VS gpt

Mock Paper

2022

(a) Assume that the BBC recruited you to develop a social media application. The BBC is interested in knowing their readers’ feelings on the news and other events covered by the broadcaster. Your job is to develop a classifier. In this context, answer the following questions:

(i)

Your first task is to create Twitter datasets with positive and negative statements so that they can be used for estimating the probabilities for words in the respective classes. [Hint: Assuming that you have a social media crawler, discuss how you will automatically label positive and negative tweets; how will you avoid spurious data]

use hashtag based data collection

1.Data collection: Use a social media crawler to collect tweets that mention the BBC or are in response to BBC’s tweets. This could be done using Twitter’s API to search for tweets containing specific keywords, hashtags, or mentions related to BBC’s news and events.

2.Pre-process: Clean the collected tweets by removing irrelevant information such as URLs, mentions, and special characters. Convert all text to lowercase and tokenize the words for easier analysis. Ignore this step.

3.Automatic labeling: Loop the tweets, using lexicon approach to compute the overall score of a tweet. if a tweet has been annotated with a hashtag, then use NRC hashtag lexicon approach to label the tweet, if it has a emoji at the end then use emoticons based approach to label the tweet. Then calculate the score for each feature (word/hashtag/emoticon), select the label correspondingly based on overall score.

image-20230501223242594

4.Avoid spurious data. Pre-defined relevant hashtags: Compile a list of relevant and meaningful hashtags associated with the BBC or specific news and events covered by the broadcaster. Only collect and analyze tweets containing these pre-defined hashtags to ensure high-quality data. Noise filtering: Filter out tweets that are ambiguous, contain sarcasm, or are irrelevant to the BBC’s content. This can be done using advanced NLP techniques or by setting a minimum sentiment score threshold to exclude borderline cases.

(ii)

Your second task is to develop a lexicon-based automatic sentiment analysis method, which assigns sentiment intensity between [-100,100].Describe an algorithm that also uses the dataset you created in (i). [Hint: identify a suitable lexicon; identify linguistic cases you may handle; specify a scoring method]

To develop a lexicon-based automatic sentiment analysis method that assigns sentiment intensity between [-100, 100], follow these steps:

  1. Choose a suitable lexicon: Select a pre-built sentiment lexicon that provides sentiment scores for words. Examples of such lexicons include SentiWordNet, AFINN, or VADER. These lexicons typically assign sentiment scores to words on a scale of -1 to 1 or -5 to 5.

  2. Preprocessing: Clean the dataset created in step (i) by removing irrelevant information like URLs, mentions, and special characters. Convert all text to lowercase and tokenize the words for easier analysis.

  3. Handling linguistic cases: Address various linguistic cases to improve sentiment analysis:

    a. Negations: Identify negation words (e.g., not, never, isn’t) and modify the sentiment score of the words that follow. For example, you can reverse or reduce the sentiment score of the word following the negation.

    b. Intensifiers: Identify intensifier words (e.g., very, extremely, really) and adjust the sentiment score of the following word accordingly. For example, you can multiply the sentiment score by a factor (e.g., 1.5 or 2) based on the intensity of the intensifier.

    c. Diminishers: Identify diminisher words (e.g., slightly, barely, hardly) and adjust the sentiment score of the following word accordingly. For example, you can multiply the sentiment score by a factor (e.g., 0.5) based on the diminishing effect.

  4. Scoring method: Implement a scoring method to calculate the sentiment intensity of each tweet using the lexicon and linguistic cases:

    a. Initialize a sentiment score variable to 0 for each tweet.

    b. Iterate through the words in the tweet, and for each word, check if it has a sentiment score in the lexicon.

    c. If the word has a sentiment score, adjust the score based on the linguistic cases (negations, intensifiers, diminishers) if applicable.

    d. Add the sentiment score of the word to the tweet’s sentiment score.

    e. After processing all words, normalize the tweet’s sentiment score to fit the range of [-100, 100]. For example, if the lexicon’s sentiment scores range from -5 to 5, multiply the tweet’s sentiment score by 20.

    f. Assign the normalized sentiment score to the tweet as its sentiment intensity.

  5. Evaluation: Compare the sentiment intensity assigned by the lexicon-based method with the labels in the dataset created in step (i). Calculate performance metrics such as accuracy, precision, recall, and F1-score to evaluate the effectiveness of the lexicon-based sentiment analysis method.

By following these steps, you can develop a lexicon-based automatic sentiment analysis method that assigns sentiment intensity between [-100, 100] and uses the dataset created in the previous task.

(iii)

Now that you created a sentiment analysis method, you want to verify the method’s validity from a user’s perspective. Design a scalable user-based study to ensure your sentiment scoring method is appropriate.

To design a scalable user-based study to ensure the sentiment scoring method’s appropriateness, follow these steps:

  1. Select a representative sample: Randomly sample a subset of tweets from the dataset created in the first task. Make sure the sample includes a balanced number of positive and negative tweets, as well as a diverse range of topics covered by the BBC.
  2. Prepare the evaluation interface: Develop a user-friendly interface for the study participants. The interface should display a tweet and its corresponding sentiment score, calculated using the lexicon-based sentiment analysis method. Participants should be able to rate the sentiment score’s appropriateness on a scale (e.g., 1 to 5, with 1 being “strongly disagree” and 5 being “strongly agree”).
  3. Recruit participants: Recruit a diverse group of participants to ensure a broad range of perspectives. You can use platforms like Amazon Mechanical Turk, Prolific, or other crowdsourcing services to recruit participants on a large scale.
  4. Training and instructions: Provide clear instructions and examples to participants on how to evaluate the sentiment scores. Briefly explain the concept of sentiment analysis, the scoring range of [-100, 100], and the evaluation scale. You may also provide examples of tweets with appropriate and inappropriate sentiment scores to help participants understand the task better.
  5. Evaluation process: Ask participants to evaluate the sentiment scores of the sampled tweets using the provided interface. Encourage them to consider the context of the tweet and the sentiment score’s appropriateness based on the tweet’s content.
  6. Collect user feedback: Allow participants to provide qualitative feedback on the sentiment scoring method, highlighting any issues or suggestions for improvement. This feedback can help identify potential areas of refinement for the sentiment analysis method.
  7. Analyze results: After collecting the evaluations, calculate the average appropriateness score for each tweet’s sentiment score. High average scores indicate that the sentiment scoring method is appropriate from the user’s perspective. Analyze the qualitative feedback to identify common themes and potential areas for improvement.
  8. Iterate and improve: Based on the study results, refine the sentiment analysis method to address identified issues and incorporate user feedback. Repeat the user-based study with the updated method to evaluate its effectiveness iteratively.

By designing and conducting a scalable user-based study, you can ensure that the sentiment scoring method is appropriate from the user’s perspective and make data-driven improvements to enhance its accuracy and effectiveness.

(a)Create a vector representation for the following short text. Identify and remove potential stop words. “@AlanStainer @takeitev It’s mad isn’t it. In the UK there are 8k petrol stations with multiple pumps and 25k chargers (increasing by 300 pm). They do know the climate emergency is now right? Not in 30 years’ time, Just asking”

Stopwords: “the”, “in”, “and”

[“it’s”, “mad”, “isn’t”, “uk”, “8k”, “petrol”, “stations”, “multiple”, “pumps”, “25k”, “chargers”, “increasing”, “300”, “pm”, “know”, “climate”, “emergency”, “right”, “30”, “years’”, “time”, “asking”]

(b)Create all biterms from the following text, “in the UK there are 8k petrol stations with multiple pumps and 25k chargers

(“in”, “the”) (“in”, “UK”) (“in”, “there”) … (“pumps”, “and”) (“pumps”, “25k”) (“pumps”, “chargers”) … (“and”, “25k”) (“and”, “chargers”) … (“25k”, “chargers”)

(c) Assume you have developed a topic model on a collection with your university communications for the last academic year. Design a user-centred experiment to evaluate the interpretability of the model. [Hint: design tasks and justify, selection of subjects, number of users, what will you measure, how do you prove the results]

Designing a user-centered experiment to evaluate the interpretability of a topic model involves several components, including selecting subjects, designing tasks, determining the number of users, and measuring performance. Here’s an outline for such an experiment:

  1. Define the goal: The primary goal of the experiment is to evaluate the interpretability of the topic model in terms of the coherence and relevance of the identified topics within the context of university communications.

  2. Selection of subjects: Choose participants who have some familiarity with the university environment, such as students, faculty, and administrative staff. This will ensure that they can understand and assess the relevance of the topics generated by the model.

  3. Number of users: To ensure the results are reliable and generalizable, aim for a diverse sample of participants. A sample size of 30-50 participants is typically considered sufficient for user-centered experiments, but the optimal number may vary depending on factors such as the complexity of the tasks and the desired statistical power.

  4. Design tasks: Create tasks that help assess the interpretability of the topics generated by the model. Example tasks include:

    a. Topic labeling: Ask participants to assign meaningful labels to a given set of topics. This assesses whether users can understand and make sense of the topics generated by the model.

    b. Topic ranking: Ask participants to rank a list of topics based on their relevance or importance to the university’s communications. This helps evaluate the model’s ability to identify meaningful and relevant topics.

    c. Document-topic assignment: Provide participants with a set of documents and ask them to assign the most relevant topic from the model to each document. This task assesses whether users can effectively map the generated topics to real-world documents.

  5. Measurement: Collect quantitative and qualitative data to evaluate the interpretability of the model.

    a. Quantitative measures: Calculate agreement scores (e.g., Fleiss’ Kappa or Cohen’s Kappa) to measure the consistency among participants in terms of topic labeling, ranking, and document-topic assignments.

    b. Qualitative measures: Collect subjective feedback from participants about the coherence, relevance, and overall interpretability of the topics. This can be done through open-ended questions, interviews, or questionnaires.

  6. Analyze results: Analyze the quantitative and qualitative data to assess the interpretability of the model. High agreement scores and positive feedback from participants would indicate that the model generates interpretable topics.

  7. Prove the results: To prove the results, compare the performance of the topic model with alternative models or baseline approaches (e.g., LDA, NMF). Conduct a similar user-centered experiment for the alternative models and compare the agreement scores and subjective feedback. A higher performance of the developed topic model compared to the alternatives would provide evidence for its interpretability.

By carefully designing and executing a user-centered experiment, you can evaluate the interpretability of a topic model in the context of university communications, ensuring that the model generates meaningful and relevant topics for the users.

(d)You have collected tweets and newspaper articles from Scotland for the last month.

Describe a method to develop topic models from these datasets. [Hint identify issues in dealing with such heterogenous datasets; how would you handle such issues]

l Prepare the tweets data, Pre-process of the data.

l Create a bigram model and trigram model. Then use the data with bigrams and trigrams to create the dictionary which will be used in model training.

l Use dictionary to create the corpus.

l Use LDA model to run topic modelling by the previous data which has been processed.

l Use different measures which including KL divergence, perplexity, and coherence to evaluate the model. This step will also help us to determine the number of topics.

l Update the model parameter by the result of evaluation.

l Visualize the model. The topic keywords will be displayed by a chart or a word cloud.

Handling heterogeneity: Tweets and newspaper articles have different lengths, styles, and contexts. To address these issues:

a. Length normalization: To mitigate the differences in length between tweets and newspaper articles, consider using techniques like document sampling or text chunking. For instance, divide newspaper articles into smaller chunks of approximately the same size as tweets.

b. Text representation: Use a suitable text representation method that can capture the context and semantics of both short and long texts. Techniques like TF-IDF, word embeddings (e.g., Word2Vec, GloVe), or even more advanced methods like BERT embeddings can be used.

c. Combining datasets: Combine the preprocessed tweets and newspaper article chunks into a single dataset to build a unified topic model.

3.Applying clustering (e.g., Single-pass clustering) on Twitter data stream will create groups of similar tweets of varying sizes. Design an algorithm to detect events from such groups. Specifically, answer the following questions.

(i)What role do entities play in detecting events? How would you reduce the cost of detecting entities?

Entities can include named entities like people, organizations, locations, or other relevant terms specific to an event. By identifying and tracking entities within the tweet clusters, we can recognize emerging events and monitor their development. Entities can help:

  1. Identify the key components of an event: Entities can represent the primary subjects or objects related to an event, enabling us to understand the event’s main focus.
  2. Differentiate between events: Entities can help distinguish between different events by providing context-specific information, which allows us to separate events with similar keywords but different contexts.
  3. Track the progression of events: Monitoring the frequency and co-occurrence of entities over time can provide insights into how an event is evolving and help identify new developments or trends.

Reducing the cost of detecting entities:

Detecting entities in real-time can be computationally expensive, especially for large-scale Twitter data streams. To reduce the cost of detecting entities, consider the following strategies:

  1. Entity extraction optimization: Use efficient named entity recognition (NER) tools or libraries that can handle streaming data, like spaCy or the Stanford NER. These libraries are optimized for performance and can handle large-scale text data efficiently.
  2. Filter irrelevant data: Preprocess the Twitter data stream to remove irrelevant information, such as URLs, hashtags, user mentions, and stop words. This reduces the volume of data to process and allows the entity extraction to focus on relevant content.
  3. Keyword-based entity detection: Instead of using full-fledged NER models, you can create a list of relevant keywords or entities specific to the domain of interest. This can help in detecting events of interest with lower computational cost.
  4. Incremental entity extraction: Instead of processing the entire data stream at once, perform entity extraction incrementally as new tweets arrive. This can help distribute the computational load over time, making it more manageable.
  5. Parallelization: Utilize parallel processing techniques to distribute the entity extraction task across multiple cores or machines, which can significantly speed up the process and reduce the overall computational cost.

By incorporating these strategies, you can reduce the cost of detecting entities in Twitter data streams while still effectively identifying and tracking events.

(ii)If you were to use tf-idf concepts for representation, how would you capture them?

To capture TF-IDF concepts for representation in the context of detecting events from Twitter data, follow these steps:

  1. Data preprocessing: Preprocess the tweets by removing irrelevant information (e.g., URLs, hashtags, user mentions), tokenizing the text, removing stop words, converting all text to lowercase, and performing lemmatization or stemming.

  2. Term Frequency (TF): Calculate the term frequency for each term in each tweet. The term frequency is the number of times a term appears in a tweet divided by the total number of terms in that tweet. This step normalizes the frequency of terms in each tweet, accounting for varying tweet lengths.

  3. Inverse Document Frequency (IDF): Calculate the inverse document frequency for each term across the entire Twitter data stream. The IDF measures the importance of a term by considering its rarity in the entire dataset. To calculate IDF, first compute the document frequency (DF) – the number of tweets containing a particular term. Then, compute the IDF as the logarithm of the total number of tweets divided by the DF.

  4. TF-IDF representation: Compute the TF-IDF score for each term in each tweet by multiplying the TF and IDF values. This score represents the importance of a term in a tweet while considering its rarity in the entire dataset.

  5. Feature vectors: For each tweet, create a feature vector with the TF-IDF scores of its terms. This can be represented as a sparse vector where each dimension corresponds to a unique term from the entire vocabulary across all tweets, and its value is the TF-IDF score for that term in the specific tweet. If a term is not present in a tweet, its value in the vector will be zero.

  6. 计算TF(Term Frequency,词频):对于给定的文档d和词语t,它的TF值可以通过如下公式计算:

    TF(t, d) = (词语t在文档d中出现的次数) / (文档d中词语总数)

    其中,分子表示词语t在文档d中出现的次数,分母表示文档d中所有词语的总数。TF值反映了词语在当前文档中的重要程度,即出现次数越多,TF值越大。

  7. 计算IDF(Inverse Document Frequency,逆文档频率):对于语料库中的所有文档和词语t,它的IDF值可以通过如下公式计算:

    IDF(t) = log((语料库中文档的总数) / (包含词语t的文档数+1))

    其中,分子表示语料库中所有文档的总数,分母表示包含词语t的文档数(加1是为了避免分母为0)。IDF值反映了词语在整个语料库中的重要程度,即出现文档数越少,IDF值越大。

  8. 计算TF-IDF:将TF和IDF相乘,得到词语t在文档d中的TF-IDF值。

    TF-IDF(t, d) = TF(t, d) * IDF(t)

    TF-IDF值反映了词语t在文档d中的重要程度,同时也考虑了词语在整个语料库中的出现情况。在计算TF-IDF时,还可以对TF值进行平滑处理,例如使用下面的公式:

    TF(t, d) = 0.5 + 0.5 * (词语t在文档d中出现的次数) / (文档d中词语总数)

    这种平滑处理可以避免长文档中某个词语的TF值过高的问题。

(iii) How do you remove noisy or spam groups?

Removing noisy or spam groups from a dataset involves filtering out irrelevant or low-quality content. Here are some strategies to identify and remove such groups:

  1. Text-based filtering: Analyze the content of the groups and apply filters to eliminate groups containing certain keywords or patterns that are commonly associated with spam or noise. For example, you can create a list of common spam keywords, phrases, or patterns, and remove groups that contain a high frequency of these terms.
  2. Frequency-based filtering: Analyze the posting frequency of the groups. Spam or noisy groups often exhibit unusual posting patterns, such as posting the same content repeatedly, or posting at extremely high frequencies. Set a threshold for acceptable posting frequency and filter out groups that exceed this limit.
  3. User-based filtering: Analyze the users contributing to the groups. If a group consists mostly of users with suspicious behavior or characteristics (e.g., newly created accounts, accounts with very few followers or following a large number of users), it might be a spam or noisy group. You can create a scoring system to rate the credibility of users and filter out groups with a high proportion of low-credibility users.
  4. Group size: Small groups or groups with very few members might be more likely to be noisy or spammy. You can set a minimum group size threshold and remove groups that fall below this limit.
  5. Language-based filtering: Analyze the language used in the groups. Spam or noisy groups may contain a high proportion of irrelevant r nonsensical text, or text in a language that is not of interest for your analysis. Use natural language processing techniques, such as language detection, to filter out groups with content in unwanted languages or with a high proportion of unintelligible text.
  6. Machine learning techniques: Train a machine learning model to classify groups as spam, noisy, or relevant based on features like text content, posting frequency, user characteristics, group size, and language. This approach can be more adaptive and effective in identifying spam and noisy groups, especially if the model is regularly updated with new data.

By applying these strategies, you can identify and remove noisy or spam groups from your dataset, allowing for more accurate and meaningful analysis of the remaining content.

(iv) How would you identify categories of events?

1.Visualization, use diagram to demonstrate the top 20 most frequent words in a group, this can help us understand the topic of a group and help us to categorize the groups. Or use the PCA to reduce the dimensionality of the data and project it to a lower space, find the overlaps.

2.Cluster labeling, Assign descriptive labels to the groups based on the representative information extracted in the previous step. You can manually analyze the most frequent terms, key phrases, or named entities in each group and assign a suitable category label. Alternatively, you can use an automated approach like extracting the most frequent terms or key phrases as labels.

(v) Provide an algorithm for combining similar groups of tweets (e.g., tweets containing same entities).

  1. Extract entities from each group:
    • For each group, extract the named entities (e.g., people, organizations, locations) from the text of the tweets and articles.
    • You can use a Named Entity Recognition (NER) library like spaCy or Stanford NER to perform this task.
  2. Calculate entity similarity between groups:
    • Create a function to compute the similarity between two groups based on the shared entities.
    • You can use the Jaccard similarity coefficient, which is the size of the intersection of entities divided by the size of the union of entities for each pair of groups.
  3. Combine similar groups based on a similarity threshold:
    • Set a similarity threshold (e.g., 0.5) to decide whether two groups should be combined.
    • For each pair of groups:
      • Compute the entity similarity using the function created in step 2.
      • If the similarity score is greater than or equal to the threshold:
        • Combine the two groups into one.
        • Update the entity set of the combined group.
        • Remove the original groups from the list of groups.
    • Repeat this process until no more groups can be combined based on the similarity threshold.

here are some other features than entities you can use to combine similar groups:

  1. Term Frequency (TF):
    • Use the frequency of terms within each group as a feature. Calculate the similarity between groups based on the overlap of their most frequent terms.
  2. Key phrases:
    • Extract key phrases from the text in each group using techniques like RAKE (Rapid Automatic Keyword Extraction) or TextRank. Compare the groups based on the overlap of their key phrases.
  3. Sentiment analysis:
    • Calculate the average sentiment score of each group using a sentiment analysis library or pre-trained model. Combine groups with similar sentiment scores.
  4. Topic modeling:
    • Apply topic modeling techniques like Latent Dirichlet Allocation (LDA) to the dataset. Compare groups based on the distribution of topics within each group.
  5. Word embeddings:
    • Use pre-trained word embeddings like Word2Vec, GloVe, or BERT to represent the text within each group. Calculate the average embedding for each group and compare groups using cosine similarity or other distance metrics.
  6. N-grams:
    • Extract N-grams (sequences of N consecutive words) from the text within each group. Compare groups based on the overlap of their most frequent N-grams.
  7. Text similarity:
    • Calculate the average pairwise text similarity within each group using a text similarity measure like cosine similarity, Jaccard similarity, or edit distance. Combine groups with similar average text similarity scores.
  8. Hashtags and user mentions:
    • For Twitter data, extract hashtags and user mentions from the tweets within each group. Compare groups based on the overlap of their most frequent hashtags and user mentions.

(vi) how do we find the brusting clusters?

To find bursting clusters, you can use a sliding window algorithm combined with a clustering method. The idea is to divide the data into time windows and apply clustering within each window to identify groups of similar items. By comparing the clusters across different time windows, you can detect bursts of activity. Here’s an outline of the algorithm:

  1. Divide the data into time windows:
    • Choose an appropriate window size and step size based on the dataset and the expected duration of bursts.
    • Divide the data into non-overlapping or overlapping time windows accordingly.
  2. Apply clustering within each time window:
    • For each time window, preprocess the data (e.g., tokenize, remove stop words, stemming/lemmatization) and create feature vectors using methods like TF-IDF or word embeddings.
    • Apply a clustering algorithm (e.g., K-means, DBSCAN) to the feature vectors within the window to group similar items.
  3. Detect bursts by comparing clusters across adjacent time windows:
    • Define a burst detection criterion, such as a significant increase in the number of items within a cluster or the emergence of a new cluster with a large number of items.
    • For each pair of adjacent time windows, compare the clusters and identify those that meet the burst detection criterion.

Here’s a pseudo-code for the algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
phpCopy code
function preprocess_and_vectorize(data):
preprocessed_data = preprocess_text(data)
feature_vectors = create_feature_vectors(preprocessed_data)
return feature_vectors

function cluster_data(feature_vectors):
clusters = apply_clustering_algorithm(feature_vectors)
return clusters

function detect_bursts(window_clusters, burst_threshold):
bursts = []
for i in range(len(window_clusters) - 1):
for cluster in window_clusters[i]:
next_window_cluster = find_corresponding_cluster(cluster, window_clusters[i + 1])
if not next_window_cluster:
continue
growth = len(next_window_cluster) - len(cluster)
if growth >= burst_threshold:
bursts.append((cluster, next_window_cluster))
return bursts

data = load_data()
window_size = ...
step_size = ...
burst_threshold = ...

time_windows = create_time_windows(data, window_size, step_size)
window_clusters = []

for window in time_windows:
feature_vectors = preprocess_and_vectorize(window)
clusters = cluster_data(feature_vectors)
window_clusters.append(clusters)

bursts = detect_bursts(window_clusters, burst_threshold)

This algorithm divides the data into time windows, applies clustering within each window, and detects bursts based on changes in cluster sizes across adjacent windows. You can customize the window size, step size, clustering algorithm, and burst detection criterion based on your specific dataset and use case.

(b) Describe a methodology to predict stock trend prediction from social media data

StockNet (Deep Learning method)

StockNet is a hypothetical term that could refer to a network or system that aims to predict stock trends. While I’m not aware of a specific tool called “StockNet,” there are many approaches to predicting stock trends using machine learning and artificial intelligence techniques. One such approach is using deep learning models like recurrent neural networks (RNNs), long short-term memory (LSTM) networks, or convolutional neural networks (CNNs).

To use a deep learning model like “StockNet” to predict stock trends, follow these general steps:

  1. Data collection: Gather historical stock data, such as stock prices, trading volumes, and other relevant financial indicators. You may obtain this data from financial data providers, public financial statements, or web scraping.
  2. Data preprocessing: Clean and preprocess the data to eliminate noise, handle missing values, and convert categorical data into numerical formats. This step may also involve feature engineering to create new features that may be relevant for prediction.
  3. Feature scaling: Scale or normalize the features to ensure that they have similar ranges and are suitable for input into a deep learning model.
  4. Train-test split: Split the dataset into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance.
  5. Model selection: Choose an appropriate deep learning model based on your data and problem. Common choices include RNNs, LSTMs, and CNNs.
  6. Model training: Train the selected model on the training data. This involves adjusting the model’s parameters to minimize the prediction error. You may need to experiment with different hyperparameters, such as learning rate, batch size, and the number of hidden layers.
  7. Model evaluation: Evaluate the model’s performance on the testing data. Common evaluation metrics include mean squared error (MSE), mean absolute error (MAE), and R-squared.
  8. Fine-tuning: If the model’s performance is unsatisfactory, fine-tune it by adjusting the hyperparameters, modifying the model architecture, or changing the features used.
  9. Prediction: Once the model has been fine-tuned and performs well on the testing data, use it to predict future stock trends based on the input features.

Keep in mind that predicting stock trends is inherently difficult due to the complex and dynamic nature of financial markets. No model can guarantee accurate predictions, and it is essential to manage risks and avoid relying solely on model predictions for making investment decisions.

Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2021-2024 Mingwei Li
  • Visitors: | Views:

Buy me a bottle of beer please~

支付宝
微信