Twitter’s Impact on Trending Topics

1. Introduction

Twitter is one of the most popular blogging website, where individuals can share their thoughts and views on any topic which consists of trending breaking news to activities surrounding themselves. These messages words or texts are commonly known as Tweets which can be viewed worldwide, whereas if other people wanted to view they can follow the user to see the timeline. Twitter provides real-time information which is disseminated worldwide. The popularity of Twitter is dramatically increasing day by day, a report by Lee et al., states ‘200 million tweets are generated every day’. (Lee, et al., 2011). However, when a new topic is being posted by individuals, it becomes popular through all social platforms which are known as the trending topic. On the other hand, these trending topics are discussed by millions or billions of people around the world. These trending topics include hashtags, names, location, different languages and many other. Due to the complexity of trending topics, it is very important to classify the text into different categories, so the system and humans can easily access the information of trending topics. There are millions of tweets posted on Twitter every day which can be retrieved through Twitter API.

Introduction of Project

In this dissertation, the literature review shows a deep study on the collection of Twitter data which has been processed, various supervised machine learning algorithms which also show their efficiency. One of the major issues faced during text categorisation is the incorrect matching of words. To overcome this issue, the Bag-of-Words approach was used, but on a few instances, this approach couldn’t help. In order avoid this issue word embedding was used, this approach is used for long texts. Word embedding is a dense, low dimensional and real-valued vector for a word. The next chapter shows deep study literature review on machine learning algorithms. These machine learning approaches show the classification accuracy. In this dissertation, six classification approaches have been applied, which integrate the five individual algorithms into an ensemble-based classification decision. The methods used are Naïve Bayesian classification, Support Vector Machine (SVM) classifier, Multinomial Naïve Bayes Classifier, Bernoulli Naïve Bayes Classifier, Decision Tree classifier and Random Forest Classifiers. The experimental results of these classifiers have been provided which have been criticised analytically. This dissertation is being organised in the following way. First the project rational has been explained which has been followed by the aims and objectives. Secondly, a deep literature survey has been briefly summarised which also shows the previous work done by other researchers. Section 3, of the dissertation, shows the Design and research methodologies applied to the data collected, the design architecture and the tools implemented to fulfil the requirements to achieve results. The six machine learning approaches have also been discussed in the same section which shows the ensemble method for topic classification. The last section explains the results and conclusions obtained from the experiments.

Aims and Objectives

Take a deeper dive into Twitter’s Evolution with our additional resources.

In this dissertation, the first aim is to retrieve and collect the data from Twitter on various trending topics. This objective will be carried out by researching the most discussed topic on Twitter.

The next aim of this dissertation will be extracting the data from the Twitter API. API’s will be used to retrieve the basic data. The way in which this objective will be met is using implementing an algorithm using scripted language to retrieve data using the Twitter API’s (Application Programming Interface).

The programming language which will be implied to achieve the results is Python Language.

The dissertation will also aim to classify the Twitter API’s or streams according to the topic. The classification of text will be carried out by the approach of Bag-of-Words and Word embedding approach.

Furthermore, the next aim is to classify the results based on polarity. In order to achieve this, aim the results will be categorised into positive, negative and neutral texts. In order to achieve the outcome, text mining will be performed on data to classify the text according to the topic.

Further to this research, the data will be analysed to compare the performance of the proposed algorithm in classifying the tweets. This data will be analysed by six machine learning approach to get accurate results. The detail experimental comparisons of the six different topic classification methods and analytics comments are provided in the results section.

The API’s will be represented graphically as well to show the difference in the popularity of Tweets and the results of the machine learning approach.

The project main aim is to design a classifier which can organise and detect the new posts posted by the users and can classify according to the text categories by using automated machine learning algorithms.

2. Literature Review

Social Networking sites are being used extensively worldwide, among which Facebook, Google Plus, Twitter etc. are the most common social networking platforms for individuals. These social networking sites are used to share the views, thoughts and opinions. The opinions of individuals can be categorised in the form of text. Text classification is a method which provides the raw data from in the form of pre-defined text and classifies into positive and negative opinions. A lot of research work is being done by various researchers towards the topic classification methods. Topic classification is a division of text mining, which includes information retrieval, lexical analysis and many other techniques. Many methods widely applied in text mining are exploited in sentiment mining as well. The first step for the Twitter data analysis is the collection of data which has been done by many researchers through various techniques. Fuhry, et al. used an approach of ‘Bag of words’, where he made a category of tweets ‘to a predefined set of generic classes’.( Fuhry, et al., 2010, p. 4). He picks out the tweets information from messages, views, thoughts on a topic, reviews, news and ‘domain specific features’. However, Fuhry et al. also used another strategy ‘8 features which consist of one nominal and seven binary features’. This approach was used to pick out the ‘slangs’, ‘signs’, ’phrases’, ’opinion words’ etc. ( Fuhry, et al., 2010). Through these two techniques, Fuhry et al. gathered the information of tweets which set a foundation to retrieve new tweets online with ’better accuracy’, although ‘noise’ was found in new tweets ‘noise removal techniques’ were also kept in consideration. ( Fuhry, et al., 2010). Cambria et al., proposed the simplest way to do topic classification is by means of the ‘Lexicon-based method’ which computes the ‘sum of the number of the positive words as well as negative words which appear’ in the text file to fix the classification of the text file. (Cambria, et al., 2013). However, the weakness of this approach is poor recognition of effect when negation is involved.

Dellia et al. gathered research from Twitter and categorise the text using ‘Twitter API (Application Programming Interface)’ and defined the algorithm using python. (Dellia & Tjahyanto, 2017). Whereas, Gull et al. extracted the data from Twitter and saved the data in the form of tweets. In addition to this Gull et al. ‘extracted the data from Twitter using Twitter API’, and defined an algorithm using python ’named tweeps’. (Gull, et al., 2016). However, some researchers use a different approach for extracting data from Twitter. Lee et al. used ‘data mining software in JAVA’ through which the results were obtained by using machine language, it supported the ‘algorithms’ based on ‘data preprocessing’ and ‘data classification’. (Lee, et al., 2012). A majority of the methodologies have been employed to text classification, and these systems were majorly built on monitored learning machine algorithms. The Naïve Bayes approach has been a common technique in the classification of text due to its ‘simplicity as well as efficiency’. (Melville et al., 2009). The philosophy behind is that the shared likelihood of two events can be used to forecast the probability of one activity given the happening of the other activity. The main notion of the Naïve Bayesian approach is that the aspects in categorising are independent to one another, which considerably declines the computing complexity of the algorithm for categorisation. Also, the Support Vector Machine (SMV) method was regarded as one of the greatest text classification approach. Xia et al., (2010) employ the Care Vector Machine technique as a arithmetical cataloguing technique that is built on the intensification of the boundary between the events as well as the hyperplane separation. On the other hand, Han et al., (2016) suggested a dissimilar approach from that of machine learning approaches, the K-nearest neighbours (KNN) method is not able to extract any aspects from the training dataset and only relate the similarity of the document to that of its neighbours. In the choosing of features, Tan et al., (2008) give a comparison between four features of selected methods in addition to five machine learning approaches using Chinese texts. The scholar asserts that the Information Gain algorithm it better regarding performance unlike feature selection technique thus backs Vector (Tan et al., 2008). According to Jain et al., data collection report from Yahoo website that was classified into seven classes (health, business, politics, entertainment, technology and sports) every item of the news indexed manually through the help of human experts. With the use of the 814 training documents, Jain et al. compared four groups of algorithms (Nearest neighbour classifier (NN), Naïve Bayes Classifier (NB), subspace Classifier (SS) and decision tree classifier (DT) on the two data sets tests. The result of the experiment reveals that all the four algorithm classifications work relatively well, but the Naïve Bayes method works best on the dataset test.

On the other hand, Joachims applied a lessen vocabulary as the character set by the initial word stemming as well as employing a stop list of each recurrent words by eliminating words that do not frequently appear from the character set. With the use of approximately 13,000 documents from the Reuters-21578 documents set and 20,000 medical abstracts from the Ohsumed corpus, Joachims makes a comparison of the functionality of some algorithms including Naïve Bayes and SVM. On both documents, the test showed that the multinomial naïve Bayes classifier was able to perform much better and gave the most accurate outcomes (Joachims, 1998). By making use of the whole vocabulary as the character set, Rennie and Rifkin realise that the SVM algorithm which is applied on two sets of data; 19,997 news linked documents in 20 classifications as well as 9649 industry sector data documents in a group of 105. It is assumed that the Multinomial Naïve Classification algorithm documents used are independent while SVM and Bayes algorithms are linear, scalable to huge sets of document and efficient. Nonetheless, after making a comparison of the two classifiers, it is found that the Support Vector Machine provided the most accurate outcome (Rennie & Rifkin, 2001).

Design and Research Methodology

The figure below gives a description of the architecture of the system as well as describe the used methodology in the project

UML Diagram

3.1 Implemented Tools

The program was written using the Python programming language, because of its vast libraries availability. Natural Language Toolkit (NLTK) is the commonly used library in this program that is written using Python thus it is easy to use. The other tool used is the Python Twitter tools and Tweepy since it offers an easy way to interact with the Twitter API from Python and compatible with Python version 3>

3.2 Twitter API’s

The Twitter API is provided as a way which programmatically enables the access to existing data on Twitter. Regarding access to the tweets posted by users, the API provides two dissimilar ways, with unique benefits as well as restriction. The two option include the streaming API and REST (Representational state transfer) API. The major restriction of every option is that for the streaming API it only access to fresh data, while REST API solely enable specific requests at a given period. The access to the Twitter API is made possible by ensuring that the user uses the Python direct access to websites (Zhang et al., 2010). Also, the Twitter Python Tools change the replies from the Twitter API into native data structures to Python for instance dictionaries or lists. Therefore, Twitter enables the program written in this format to be easily set up for dissimilar users. While the Restful API makes use of the HTTP source to gather data. As a result, both Twitter APIs (Restful and streaming) can be accessed with the help of certain libraries like pip and Tweepy since they offer numerous easy-to-use operability functions together with well-structured documentation.

3.2.1 The REST API

The Twitter representational state transfer (REST) API [9] provides entrée to read and write Twitter data in batches. It allows various options for access to dissimilar data parts together with access methods such as the most recent tweets send by a specific user, any given tweet, and if one has access to the user authentication details, the most recent tweets shall be shown to the user. Additionally, the REST API provides an exclusive approach to post new tweets as well as other updates to the Twitter service (Saif et al., 2011). However, the major disadvantage with the REST API is its limited rate which means that there is a restriction of the number of requests that a program is allowed to make at a given time. This restriction can be challenging when attempting to gather huge data quantities or when repeatedly fetching the little amount of data (Agarwal et al., 2012). The REST API allows the probability to post content to Twitter something that is not possible with the streaming API.

3.2.2 The Streaming API

The streaming API [10] provides a few and more specific actions as compared to REST API. However, it can be utilised in the collecting of massive quantities of data. The streaming API keep on delivering new data from some subset data posted on Twitter. Additionally, it can be utilised for collecting new tweets posted for specific users to view or a randomly picked subset of all the posted tweets on Twitter. Moreover, it is possible to receive tweets which are similar to certain filters such as tweets containing specific words or having been posted by a particular group of users (Agarwal et al., 2012).

3.3 Coding in Python Language

To be in a position to import the streams from Twitter, one has to create a Twitter API to create a Twitter API it calls for a legal Twitter account. Thus, creating a new application on Twitter, it can be possible to access Keys as well as Access Tokens. The image below shows an illustration of the Keys and Token Access.

In this case, the algorithm can be manipulated as well as streamed through Twitter to categorise the text. The algorithm illustrated below shows information imported from Twitter regarding Brexit. The code prints all the streams that are related to Brexit.

When the code is run, it gives us all the information which has been tweeted on Twitter, and it categorises the word ‘Brexit’ on Twitter. Now the manipulated data can be saved somewhere to enable find the positivity as well as the negativity contained in the text for categorisation. So that to save this data a (.csv) file is used to enable us to create a database of text to trim the needed information. In this case, I have already created a file and named it (123.csv) as illustrated in the image below.

The code will be useful to us in navigating the text with dissimilar information such as the when it was created, tweet id, the text of the tweet among others. This is significant since it enables us to determine the number of students that are in favour of Brexit.

3.4 Datasets

When executing a program, the program automatically downloads as well as process new tweets from a limited group of Twitter users. The collection of this data depend upon the Python Twitter Tools for fetching data from Twitter. Thus, tweets that are not of relevance are scrapped off from the dataset with the help of metadata attributes made available by Twitter to determine the significance of every tweet. Such tweets consist of original tweets and retweets. Tweets that are irrelevant are discarded and the relevant tweets labelled in the dataset as positive, negative or neutral in a manually way. In the case of this project, the dataset has a total of 50,000 tweets of which 28,000 tweets have been labelled positive, which represent the number of students in favour of Brexit. 14,000 tweets are labelled negative that means that the pupils not in favour of Brexit while 6,000 tweets are labelled as neutral which represents those pupils that were neither in favour nor against and lastly but not least 2,000 tweets were discarded for being irrelevant from Brexit. The technique utilised in categorising the dataset was word embedding and BOW. The table below illustrates the topic distribution

The figure above shows the percentage of the dataset.

3.5 Machine Learning Approach

A monitored machine learning approach is a technique that is tasked with deducing the operation from identified preparation models. The training sample for monitored learning comprises a massive set of samples for a specific subject matter. In monitored learning, each training data samples comes in a pair of feedback (vector quantity) as well as output value (anticipated outcome). These algorithms analyse data and then produce an output function that is then utilised for mapping new sets of data to the corresponding categories. In this project, six machine learning classifiers have been employed as follows.

3.5.1 Naïve-Bayes Classifier

Naïve-Bayes classifiers are probabilistic classifier that come under machine learning techniques. These classifiers depend upon Bayes theorem with a strong (naive) assumption of independence among each pair of characteristics. The Naïve Bayesian approach is the most commonly used technique to categorise text data. The algorithm makes assumptions that dataset components are independent of each other in addition to the happenings in dissimilar datasets, which shows their importance to specific data attributes. The Naïve Bayesian classifier accord every tweet text like set-of-words. The Naïve Bayesian classifier moves one tweet file and then computes the results of the chances of each happening aspect in the tweet for the three sentimentality alignments, which is neutral, and positive or negative. The text alignment of these tweets is categorised among the three text alignments. That gets the largest product possibility. In the case of this dissertation, Naïve Bayes uses the algorithm given in NLTK to execute tests and experiments.

3.5.2 Multinomial Naïve Bayes

Multinomial NB is used to enlarge the usage of NB algorithm, which executes NB for multinomial circulated and make use of one of its description for classifying text (whereby the word count is used to represent data as well as extremely efficient works in a continuous practice). The distribution has been parameterised by the data vectors for each ‘y’, where ‘n’ provides the total topographies (meaning the size of language for grouping of text), as well as a probability / of every ‘i’ which appears in the sample ‘y’, is yi.

3.5.3 Bernoulli Naïve Bayes Classifier

Also, Bernoulli NB executes NB set of rules for training and categorising. The Bernoulli NB classifier utiutilises for multivariate Bernoulli data circulation, t. There can be several aspects, n. Nevertheless, eryone is assumed to consist of a dualistic value or Boolean (true or false) variable. Therefore, each class have to have examples that should be illustrated in binary value variables. Additionally, if any other data type is provided, then Bernoulli NB can use the binaries in the input.

The Bernoulli NB decision rule is described as:

(xi|y)=P(i | y)xi+(1-P (i | y)) (1- xi)

3.5.4 Support Vector Machine

Support vector machine classifier is monitored machine learning models utilised for classifying binary as well as regression analysis. However, for this dissertation, the purpose is to develop classifiers that are capable of categorising tweets into three classes. In accordance to Hsu and Lin, the double classification approach is better than the one-against-all classification approach in multiclass sustenance vector machine grouping. In the double classification approach, every pair of modules consists of one SVN classifier trained to isolate the classes. We adopted the double classification method in the SVM classification approach. The libSVM set of rules is used to code the algorithm that can be utilised as a double classification for multiclass SVM organisation. In NLTK to train the SVM and execute the trials as well as experimentation classification techniques, the libSVM algorithm is applied.

3.5.5 Decision Tree Classifier

A decision tree is in the form a flowchart tree construction whereby every interior knot which presents a trial on a characteristic where every division symbolises the result of the test while every leaf knot, signifies a class. Iterative Dichotomiser 3 (ID3) was the initial well-known decision tree algorithm, which was advanced by J. Ross Quinlan. Since the ID3 continues to iterate the course of breaking down subclass data, it is likely to lead to the over-fit difficult. However, the ID3 set of rules is unable to deal with continuous features or elements which have some missing values. Therefore, C4.5 was advanced by Ross Quinlan to help resolve this problem. C4.5 discretises the continued element by setting an inception and breaking down the data into a class whose characteristics significance is passed the inception and one more class whose characteristics are below or equivalent to the threshold. Thus, C4.5 deals with omitted value features by not making use of the omitted values in data in Gain controls. Accordingly, C4.5 deals with over-fitting problems with the help of post-pruning technique referred to as pessimistic pruning. Pessimistic pruning employs Cost complexity pruning algorithm as well as the use of the training numbers to approximate the fault rate. The fault rate of the tree represents the proportion of miss-categorised occurrences in the tree knot in which its subtree is computed and then compared. In the case of this dissertation, the C4.5 algorithm was adopted in NLTK Python.

3.5.6 Random Forest Classifier

During the generation of the Decision Tree using an ID3 set of rules and C4.5. The best decision tree algorithm, is not a necessity for classification. Therefore Random Forest was advanced as the assembly method found on various decision trees. Random Forest makes use of the bagging technique in developing classification models. In every decision tree, each node is supposed to have a random subdivision of the characteristics A, a designated as the applicant attribute to break down the node. Through advancing several Decision Trees, a Random Forest classifier is developed. In the process of classifying every decision Tree in the Random Forest classifies an event, as well as the Random Forest classier, allocates it to the group with the majority chooses from the specific Decision Trees. In this dissertation, the Random Forest set of rules in NLTK was used to get results.

4. Results and Accuracy of the Machine Learning Classifiers

The experiments were undertaken using six classification models. Therefore, the machine learning classification entailed methods such as Bernoulli Naïve Bayes, Naïve Bayesian organiser, the Random Forest categoriser, Multinomial naïve Bayes, the C4.5 Decision Tree Classifier and the SVM organiser. The test outcomes for the classifications experimentations are shown in the table below. The Multinomial NB organiser has the least precision of 70.5% while the Naïve Bayesian classifier had the highest accuracy of 85.1%. The other classifiers such as the SVM organiser has a precision of 78.7%, the Random Forest classifier had an accuracy of 83.4%, while the C4.5 Decision tree had an accuracy of 83.9%. After reaching at the comparison outcome of the six test classifiers, Naïve Bayesian classifier produced the best outcome revealing that it is the most accurate classifier. The outcomes are illustrated in the table below.

Graphical representation on Naïve Bayes Classification Model

The classification accuracy as a function of the training set size with the use of Naïve Bayes method for training the classifier. The accuracy of the physically tagged dataset is the accuracy of the classification topic whereas the accuracy illustrated for outside dataset is the accuracy when classifying the Tweets.

Conclusion

This dissertation provides a pragmatic contribution to this field of study by comparing the performance of divergent popular text classification methodologies to further better the topic as well as sentiment the classification performance. In the context of analysing twitter texts, a very small amount of work has been undertaken. Nonetheless, the attained results reveal an efficient performance by Naïve Bayesian Classification. The past literature give a comparison of various traditional classification approaches and then picks on the most precise distinct grouping technique to execute the sentimentality. Therefore, the collective technique is presented by collectively involving these sentiment classifiers. Regarding that tweets are not as grammatical well-structured as systematic document texts, text-based classification with the usage of Random Forest classifier offers a fair outcome thus it can be leveraged in circumstances that we are unable to perform Naïve Bayes Classification.

Continue your journey with our comprehensive guide to Customer Satisfaction and Service Quality .

Bibliography

A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.Passonneau, “Sentiment Analysis of Twitter Data,” Annual International Conferences. New York: Columbia University, 2012.

A.M. Kaplan, and M, Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media,” France: Paris, 2010.

El-Din, D. M., 2016. Enhancement Bag-of-Words Model for Solving the Challenges of Sentiment Analysis. (IJACSA) International Journal of Advanced Computer Science and Applications, 7(1), p. 9.

Fuhry, D. et al., 2010. “Short text classification in twitter to improve Information Filtering.Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, New York, USA: ACM New York.

Dellia, P. & Tjahyanto, A., 2017. Tax Complaints Classification on Twitter Using Text Mining. IPTEK, Journal of Science (ISSN: 2337-8530), 2(1), pp. 11-14.

Gull, R. et al., 2016. Pre Processing of Twitter’s Data for Opinion Mining in Political Context. 7 09, pp. 1560-1570.

H. Saif, Y. He, and H. Alani, “SemanticSentimentAnalysisof Twitter,” Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data. United Kingdom: Knowledge Media Institute, 2011.

Lee, K. et al., 2012. Twitter Trending Topic Classification: 2011 11th IEEE International Conference on Data Mining Workshops, Canada: IEEE.

M. N. Hila Becker and L. Gravano, “Beyond trending topics: Real-world event identiﬁcation on Twitter,” in Proceedings of AAAI, 2011.

Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. "Recognizing contextual polarity in phrase-level sentiment analysis." Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, 2005.

Y. S. Yegin Genc and J. V. Nickerson, “Discovering context: Classifying tweets through a semantic transform based on Wikipedia,” in Proceedings of HCI International, 2011.

What Makes Us Unique

24/7 Customer Support
100% Customer Satisfaction
No Privacy Violation
Quick Services
Subject Experts

Dissertation Samples

Assignment/Essay Samples

Research Proposal Samples