New Dataset – @realDonaldTrump Tweet corpus

By: Richard W. Sharp

Release the tweets! We’ve made the Trump Watch’s database of tweets available on our new downloads page. These include all tweets from @realDonaldTrump going back to November 10, 2016. They were collected with Twitter’s public API using the query from:realDonaldTrump. Both the raw tweets and labels we have added are available. Each of the raw tweets files has a name in the format tweets_by_realDonaldTrump_yyyymmdd.json. The date represents the date that the tweet was collected. The tweet itself contains information about when it was created in the “created_at” field. A complete description of the information contained in a tweet is maintained by Twitter.

Since a search for tweets with the public API returns results from roughly the past week, we end up collecting the same tweet for several consecutive days. Each of the raw files contains at most one copy of each tweet (if we collected tweets more than once in a day, it’s the most recent version), however, the same tweet will typically appear in several of the files. Why keep duplicates? Because they’re not duplicates. Some features of a tweet change over time, such as the retweet count, which can give us insight into some short-lived trends. Sadly, we did not capture the recent “unpresidented” tweet, because it appeared and was corrected (in 27 minutes) faster than our collection updates , but it provides a good example of why its useful to archive the statements of public figures.

For the Trump Watch, we classify each tweet for sentiment and whether or not it’s an insult. The file trump_dump.csv contains the unique id and text of each tweet, as well as the tags we use to for classification and any notes. Please note that this is a .csv file, but it uses the | character as an alternate delimiter between fields to simplify parsing since commas are so common in the tweet text field. 

Here is how we categorized tweetID 810121703288410112:

Tag Definition Example
State All references to a country or similar entity (e.g., the United Nations, ISIS), as represented by the official apparatus of government (e.g., until 20 Jan 2017, “USA” implies the Obama administration).
Uses ISO-standard 3-letter country codes
State Sentiment The sentiment (in the eye of the tweeter) implied by each state reference. This can be positive, negative, or neutral. #SsnCHNNeg
State Insult Whether the reference to each state is an insult or a compliment (in the eye of the target state). #SinCHNIns

We will continue to regularly update and add to the collection. 

About The Author

Richard is a Seattle area data scientist who builds predictive models and the services that deliver them. He earned a PhD in Applied and Computational Math from Princeton University, and left academia for the dark side of science (industry) in 2010, following his wife to the land of flannel. Fan of coffee, beer, backpacking and puns. Enjoys a day on the lake fishing, and, better, cooking up the catch for a crowd.

No Comments on "New Dataset – @realDonaldTrump Tweet corpus"

Leave a Comment

Your email address will not be published. Required fields are marked *