what

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning almost 5 years (January 2013 - November 2017). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the usernames and we do not provide the text of the tweets. However, through the tweet IDs, actual tweet content and further information can be fetched.

More information is available at the following paper:

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets,
15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.
Nominated for the "Best Resource Paper" award!
why top

  • For relieving data consumers from the computationally intensive process of extracting and processing tweets.
  • For facilitating a variety of multi-aspect data consumption, exploration and analytics scenarios. These include:
    • time-aware and entity-centric exploration of the Twitter archive
    • data integration by directly exploiting existing knowledge bases (like DBpedia)
    • entity-centric analytics and knowledge discovery by inferring multi-aspect information related to one or more entities during certain time periods (like popularity, attitude or relations with other entities)

dataset top

TweetsKB is available as Notation3 (N3) files (split by month) through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

• Sample file: N3 sample file (50 KB)

• SPARQL endpoint containing a sample of the dataset (currently about 5% of each year-month): SPARQL endpoint

statistics top

• Main statistics:

Number of tweets 1,560,096,518 (num of tweets per month)
Number of distinct users 125,104,569
Number of distinct hashtags 40,815,854
Number of distinct user mentions 81,238,852
Number of distinct entities 1,428,236
Number of tweets with sentiment 772,044,599
Number of RDF triples 48,207,277,042

 

• Distribution of top-100,000 entity occurrences:

We notice that: i) there are around 15,000 entities with more than 10,000 occurrences; ii) there is a long tail of entities with less than 1,000 occurrences on the entire corpus.
 
• Distribution of top-100,000 hashtag occurrences:

We notice that: i) there are around 5,000 hashtags with more than 10,000 occurrences; ii) there is a long tail of hashtags with less than 1,000 occurrences on the entire corpus.
 
• Overview of entity types of the top-100,000 entities:

DBpedia type Number of distinct entities
http://dbpedia.org/ontology/Person 21,139 (21.1%)
http://dbpedia.org/ontology/Organisation 14,815 (14.8%)
http://dbpedia.org/ontology/Location 8,215 (8,2%)
http://dbpedia.org/ontology/Company 7,444 (7.4%)
http://dbpedia.org/ontology/Athlete 5,192 (5.2%)
http://dbpedia.org/ontology/Artist 3,737 (3.7%)
http://dbpedia.org/ontology/City 2,563 (2.6%)
http://dbpedia.org/ontology/SoccerPlayer 1,482 (1.5%)
http://dbpedia.org/ontology/Disease 1,425 (1.4%)
http://dbpedia.org/ontology/AmericanFootballPlayer 980 (1.0%)
http://dbpedia.org/ontology/SoccerClub 667 (0.7%)
http://dbpedia.org/ontology/BasketballPlayer 655 (0.7%)
http://dbpedia.org/ontology/BaseballPlayer 640 (0.6%)
http://dbpedia.org/ontology/Animal 635 (0.6%)
http://dbpedia.org/ontology/Country 544 (0.5%)
http://dbpedia.org/ontology/Event 510 (0.5%)
http://dbpedia.org/ontology/Actor 246 (0.2%)
http://dbpedia.org/ontology/Politician 208 (0.2%)

For around 14% of the entities, there is no DBpedia type.

data model top

RDF/S Model:

 
Instantiation example:

example queries top

•  The following query requests the number of tweets per month mentioning Alexis Tsipras (Greek prime minister) in 2015.
 

The result of this query shows that the number of tweets increased significantly in June and July, likely to be caused by the Greek bailout referendum that was held in July 2015, following the bank holiday and capital controls of June 2015.


•  The following query retrieves the top-5 entities co-occurring with Barack Obama in tweets of summer 2016.
 


•  The following query requests the top-10 hashtags co-occuring with the entity Refugee (http://dbpedia.org/resource/Refugee) in 2016.
 


•  The following query requests popular tweets of 2016 (with more than 100 retweets) mentioning German politicians with strong negative sentiment (>0:75). The query exploits extracted entities and uses query federation to access DBpedia for retrieving a list of German politicians as well as their birth place.
 

contact / provide feedback top
Please provide your feedback and any comments by sending an email at fafalios@l3s.de.
about top

L3S Research Center, University of Hannover, Germany



Contact Person:
Pavlos Fafalios (fafalios@l3s.de, http://l3s.de/~fafalios)
 
The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project).