Techniques for Twitter Event Detection

In the recent era of globalisation and digitalisation, the exponential growth of Twitter has started to draw the attention of the researchers. A significant amount works has been accomplished in the field of event detection in Twitter based on numbers of techniques. Most of the studies have used machine learning approaches, and they have used the two main techniques: supervised and unsupervised; the others have been applied the hybrid techniques. Moreover, linguistics approaches and statistics approaches have utilised for numbers of studies with the machine learning approaches.

In this section, various researches for events detention in twitter will be displayed which are based on the following approaches:

Supervised Detection Approaches

Unsupervised Detection Approaches

Hybrid Detection Approaches

In addition to the above approaches, graph-based Detection Approaches will also be discussed further.

Looking for further insights on Soil Properties in Foundation Design? Click here.

The aim of this article is to propose a novel method to classify the trend topics of Twitter, using URL, Retweets, and the users influential (tweet messages of influential users in the recent trends) as the main features for classification. Most of the studies of Twitter, focused mainly on the text content of tweets, and usually they filter out some features such as URL and retweet which have been used in this work, where these metadata have been used in order to increase the accuracy of the classification. The data of this algorithm has collected from trend topics websites, where various topics were included, which are such as Education, Cinema, Sports, Arts, and Politics. Different tweets have been selected, where the tweets containing URL, tweets containing single, multiple trend topics, as well as the tweets which do not have URL and trend topic have also included. These tweets pre-processed by removing all the non-English tweets, stop words, special symbols, and username, to use them as the input of the algorithm. The basic principle of the algorithm is based on using the features which have mentioned above. Thus, the algorithm will test the tweets based on these features, where if a tweet has URL, the website of this URL link will be classified. Therefore, the tweet will have the same classification of this website. Moreover, if the tweet has trending topics, in this case, the words from the top five retweets of that trending will be collected, and the top five influential words will be collected, then they will classify these words. In case, if the tweet has neither URL nor trend topic, in this case, they will use the traditional classifiers to classy this tweet. Furthermore, the proposed method which has been created by using WEKA (the open source from WEKA university it can be used to create data mining tools) for comparing with traditional textual classification, such as Support Vector Mechanism, NB-Naïve Bayes, and KNN-K Nearest Neighbour, and Bag of Words BOW, where the result showed this method had outperformance the conventional techniques of tweets classification.

The authors have presented STED semi-supervised target event detection application, to detect events from Twitter. In their algorithm, newspapers have been used as the external sources, and an automatically label of the data has achieved. The system can automatically extract labels from the newspapers data and then transfer them into tweets.

The HOTSTREAM application has used for detecting and tracking the news on Twitter. This application focused only on the news, therefore, the tweets which contain Hashtags #breaking News or ‘breaking news’ as keywords, were collected using API. An index based on contents has constructed by using the indexing and searching Java Library APeach Lucene. This index has created the grouping of tweets. Grouping of extraction tweets has performed, based on Term frequency (TF) and inverse document frequency (IDF), as well as the contents similarities between tweets. The proper noun which has been determined using the Stanford Named Entity Recognizer (NER) implementation and hashtags have given high score similarity. As a result, their weights were increased. The tweet will be in the cluster, if it is similar to the first story or first message in the group as well as the top K terms in the cluster (they selected K=10). Mergethre has been used to assign the new message to the cluster. The score of each group is according to the main factors of reliability (number of followers) and popularity (number of retweets). As this study aims to discover the latest news, it was necessary to evaluate whether the group contains new events, that was determined, based on the time difference between the last message and the current time. The second stage of this algorithm was the story development, where more information about the events, was provided using official external sources.

The purpose of work was to detect, track, and summarise the events on microblog and online stream. In this work, the topical words have introduced as a novel feature for events detection. The topical words are the most related words to the event, in other words, the most popular terms concerning the event, and these words are determined by their frequencies in the corpus. The algorithm began with extracting those words from the stream and then divisive hierarchical clustering applied on co-occur graph of topical words to represent their connection with each other, formulating the events subsequently. However, in order to track the events, changing among time, a Maximum weighted bipartite has used, by matching the maximum weight nodes in adjacent time and gather the events which are related to each other to create event chains. Cosine similarity utilised with a time interval to determine the most relevant posts which summarise the events, afterward, the summaries associated with event chain clusters was important to create sequences of events types.

This paper addresses, how to detect the events using the wavelet technique. EDCoW (Event Detection with Clustering of Wavelet-based Signals) application has proposed to use discrete wavelet transformation to detect the events. “Wavelet analysis is a well-known signal processing method to detect changes and peaks in signals”. Therefore, signals on individuals’ terms from Twitter have built to use as a major step for the detection. This work has three main phases which are,

First, building signals on every individual word based on their frequencies by applying wavelet transformation, and use of auto-correlations (the correlation of a signal with a delayed copy of itself as a function of delay) to discover bursts words. Only the words with high signal auto-correlations will be selected and as a result, the trivial words will be removed.

The second phase is the calculation of the similarities between signals, using cross-correlation as it is a well-known measure, for measuring the signals similarity. Finally, based on modularity-clustering, the similar wavelet signals will be clustered together to create the events. Each cluster represents an event, which has a group of words with high cross-correlation. The cross-correlation between the words in different clusters is anticipated to be low. Subsequently, based on this aspect, the number of words and cross-correlation among words have been used to identify the significant events and discard the others. The authors claimed that precisions of EDCoW for detecting the events were 76%. Wavelet transformation classifies into continuous wavelet transformation (CWT), and discrete wavelet transformation (DWT). In the previous study, the discrete has been used, while the study use (CWT), to detect the events in the Twitter stream, based on identifying the peaks in hashtags signals which have been built from their frequencies. Alternatively, a singular term to build discrete wavelet signals, the continuous wavelet signals hashtags have been used, then Latent Dirichlet allocation (LDA) has been utilised to give a summarisation and explanation of these events. In this work, the tweets have been obtained using the cURL tool and then stored in the database MongoDB in order to be able to process. The signals relied on hashtags to be constructed by using two map-reduce transformations; the first to extract the hashtags in the whole stream, which split into five-minute time intervals, and this map will provide lists of hashtags and their tweets. The second Map reduce applied to build the signals for all the distinguished hashtags, based on their frequencies. WDT has analysed these signals to detect the events, by using two wavelet tools, the first tool is the peak analysis which is applied to detect the peaks in hashtags, on the other words for determining sudden increments in the hashtag’s frequencies, and the second is the maximum local detection to detect any changing over hashtags signals. However, before starting this process, the adaptive filter Kolmogorov-Zurbenko (KZA) has been used to smooth the signals by deleting the noises. Therefore, the detection of peaks and monitoring of change can be done more accurately. Subsequently, in the stage of event detection, the researchers provided description and summarisation for these events using LDA, and also by using Gibbs Sampling for estimating the topics and it will give five different topics for each interval of five minutes for every single hashtag. The technique was assessed utilising a visualisation of the results, obtained from an eight-day dataset with 13.6 million tweets. A TwitterStand application has been developed to create an automatic breaking news detection system, and this application concentrates on the tweets which are related to the latest news. These tweets have gathered from Twitter dataset samples, different types of inputs that have been used which are such as Seeders, GardenHose, BirdDog, Artifacts, and Track (The seeders are considered as a formal and reliable source, it uses to publish news from newspapers, television stations, etc). The tweets inputs will be marked as news or junk (Except for tweets from seeders), then Naïve Baysn has used to make a distinction between, by using the Bayes theorem probability equation. Once the tweets have determined, all the junk tweets will be filtered out, and the news tweets will be clustered using Leader-follower clustering algorithm based on TF-IDF, as well as cosine similarity. The application is able to determine the locations of news, from these clusters, according to the contacts of tweets in the clusters by using the geotagging process, which has two steps; the first one is Toponym recognition to find the possible geographic location in the terms, and the second step is Toponym resolution to determine the correct location from Toponym. Using Meta-data contains the user’s location, it was considered to determine the location as well.

A hybrid method is another effective way to distinguish between real event world and non-events in the Twitter online stream. They use clustering threshold based to cluster the tweets and form the events. They consider a number of features which are, temporal (the tweets which appear in a specific time), social (users’ interactions like re-tweets, replies, mentions), topical, and Twitter-centric features to create these clusters, as well as use a cosine similarity metric and cluster centroid. Further, the classification model has applied to determine whether these clusters represent real-world events or not; they have used their classifier model RW- Event which has been created using Weka based on support vector machines algorithm.

Both of the previous studies have used hybrid techniques. However, the distinguishing of the study from the previous study, by using a number of features, to ensure the quality of the creation of the classes, where, they have used a novel classification method. This study aims at summarising the tweets by using a graph-based approach. The tweets have been collected by using API, and the authors have used the tweets which is related to flood in Uttaranchal in India. The similarity between tweets has been measured by using the term level and semantic level. The term level means, the application will look for the tweets which have similar contact, where the frequencies of the usernames, URL links, and hashtags have been calculated. Besides, Cosine distance and Levenshtein distance (which is the least number of edit operations that can be performed on a sentence to be converted to another sentence) have been used to measure the similarity in term level. In addition, the Semantic level between tweets was measured by using WorldNet. Subsequently, the similarity score between two tweets was the summation of the term-level and semantic similarity level. Afterward, tweets similarity graph (TSG) was formed, where its nodes represented the tweets that have been collected, and the weights on the edges between the pairs of tweets are the similarities scores which have been calculated in the previous phase. The community detection method has applied in this graph by using INFO map algorithm to find the tweets which are similar to each other. Therefore, the graph has divided into different groups of communities that contain similar tweets. However, there was a possibility that some of the tweets may belong to more than one community. Consequently, there will be links (edges) between different communities, these links are known as intercommunity edges, while intracommunity edges are the links within a community. Moreover, the application provided the summaries to the users. Different types of summaries have been proposed. Total summary: in this method, one tweet will be selected from each community to represent its group, and then the final summary was accomplished at the level of all the groups, while, Total degree summary: focused on the tweets that have the highest weights to be chosen to represent the summary of the community. Finally, total length summary: the principle of this method was to select the longest tweet in the community as it contains the largest number of information, and it was considered as a good representative of the community. The last phase of this application was, the evaluation of the performance of the algorithms, they have compared their methods of summarisation to the standard algorithm SumBasic which uses to summarize long documents, and it mainly depends on the word’s frequencies. They have used human volunteers to obtain summaries Then, they have used the standard metrics such as precision (P), recall(R), and F-measure (F) to compare the performance of human by their algorithm. The result showed the performance of the proposed summery methods outperformance the SumBasic.

Discover additional insights on Social Media as a Marketing Tool by navigating to our other resources hub.

The primary objective of this work was to discover the beginning of the events in Twitter online; the authors claimed that; they could detect the onset of the events during 3-8 minutes after the conversation about it has started. Their application Event Detection Onset (EDO) has five phases, which are, collecting and tokenizing tweets, the graph generating, graph purring, clustering, and finally events evaluation. At the first stage, they have collected the tweets online using API within a short time window (one minute), and after cleaning them, these tweets were tokenised. From these tokens, a weighted graph has been constructed; where they represented its nodes, while the weights on the edges are the numbers of tweets in which two terms have appeared together. As a result, the graph was dense. Therefore it was necessary to minimise the size of this graph. For this reason, the purring graph phase took place. The purring process started with determining the importance and emerging words by using Kullbak-Leibler (KL). The authors defined the important words as the words which appear frequency in tweets, while the Emerging are the words which are suddenly utilised heavily, but they do not appear in the past. Then, the nodes and edges have been deleted based on as follows, all the terms in the graph which are not important or emerging have deleted, all the edges have a weight equal, and all the important words which are not in the heap of emerging words have been deleted. Afterward, voltage cluster algorithm applied on the purring graph to determine the events, based on a physic approach of the electric circle. The graph transformed into an electric circle, each edge considered as a resistor and the two nodes which links them, randomly selected to represent the positive and negative units. The clusters events identified based on voltages values of each node, which has been obtained by using Kirk equation and the values should be between 0 and 1. Subsequently, each node has determined to which the community belongs. Finally, the events have evaluated in order to determine whether or not they are credible, where the events are credible when they appear in specific time, a period of three days has been set, and the authors used their model to determine which event will be appeared during this period.

This article aims at detecting the emerging unspecified events in Twitter stream online. Collecting tweets was the first step in this application, as the authors focused on emerging events, and they were looking for tweets which are re-traded heavily among Twitter users, only retweets have been selected. The data has been pre-processed by removing stop words and URL, using word segmentation, along with using Lemmatisation (the various inflected forms of a word have grouped to only one unit). Then, the undirect weighted graph has constructed every 30 minutes, by using retweets as vectors, and the similarity score between them represented the wights on the edges, which have been calculated by utilising Tf-IDF. To detect the events, Markov clusters algorithm (MCL) has applied on the graph. The concept of MCL based on using random walking, MCL will start at any vector in the graph then random walking will take place to form the clusters. The algorithm works by assuming all the clusters already exist in the graph, therefore when the algorithm takes a long time of walking, that means it is inside the cluster, where similar objects are together and then clusters will be configurated by stopping at the same starting point while short walking will be outside the cluster. It is essential to determine the significance of the events by using two factors, which are, the degree related (determine if the event is concurrent) and the size of the event. The reason for choosing the event size factor is that the event in which many discussions have taken place about it, that indicates its importance. The last stage was, to identify the emerging events, where they relied on trend events to detect them. Since the emerging events are events that are heavily discussed at present, without any related to the past, from this concept they can detect them from trending events, wherein these events will happen at the beginning of the trend events. The result appeared the proposed approach that can detect the emerging and the result has been surveyed and affirmed by fifteen examiners with 70-80% precision for event detection.

What Makes Us Unique

24/7 Customer Support
100% Customer Satisfaction
No Privacy Violation
Quick Services
Subject Experts

Dissertation Samples

Assignment/Essay Samples

Research Proposal Samples