RUB Corpus
The RUB Corpus is a collection of Russian-language official government speeches, interviews, and press releases made by top policymakers in Russia, Ukraine, and Belarus from 2006 to 2016.
The Corpus is availabe for download under the “Corpus Download” heading below.
The Corpus is divided by country in plain text .tsv (tab-separated values) files.
Introduction
The corpus is a collection of 71,515 Russian-language texts published between 01 January 2006 to 31 December 2016.
A text is sentences spoken by an incumbent president, prime minister, or minister of foreign affairs from a single speech, interview, or official press release on a particular date.
The sources for these texts are online, official government archives.
The texts were gathered as part of a PhD dissertation to trace how policymakers’ opinions on particular subjects changed over time.
Tracking what top policymakers say relates to the strong influence leaders and key officials have on policy development in nondemocratic regimes (Ambrosio, 2017, pp. 203–204; Pavlovsky, 2016, pp. 10–16; Rudyj, 2020, p. 198).
The corpus can be used, for example, to analyse the sentiments (positive, negative, neutral, or ambiguous) of Russian-speaking, top policymakers on issues such as NATO, democracy, bilateral relations, and so on.
The code used to carry out this sentiment analysis can be found on the Code page.
For examples of how the corpus and code can be used, see the Data Visualisations page.
The Collections
The texts were gathered using Python 3.7.1 (Van Rossum & Drake, 2018), BeautifulSoup 4.7.1 (Richardson, 2019) and other Python third party software (such as Pandas) to crawl webpages and scrape the relevant texts (PyData Development Team, 2019; Reitz, 2019).
Russia
14,165 texts were collected for Vladimir Putin, Dmitry Medvedev, and Sergei Lavrov. Sources for Russian policymakers include the President of Russia website (Kremlin, 2020), two Government of Russia websites for prime ministerial texts (Government of Russia, 2020; Office of the Prime Minister, 2012), and the Russian Ministry of Foreign Affairs website (MFA of Russia, 2020).
Ukraine
51,827 texts were collected for 12 policymakers (four presidents and eight prime ministers). The sources for Ukraine’s policymakers are various iterations of its “Government Portal” (Government Portal, 2017), which archived the majority of top officials’ speeches and interviews from the early 2000s to the present. But when Ukrainian President Petro Poroshenko lost re-election to Zelenski in 2018, the portal’s presidential archive was abruptly taken offline. Therefore, the internet-archiving site, Wayback Machine (Internet Archive, 2020), had to be used to source Ukrainian presidential texts. The still accessible elements of the Government Portal archive were used for the prime ministerial texts.
Belarus
5,523 texts were compiled for nominal Belarusian President Aleksandr Lukashenka only. Online resources for other top Belarusian policymakers were scattered, limited, or non-existent. Sources for Lukashenka, on the other hand, are comprehensive and compiled at the “President of Belarus” website (Presidential Press Service of the Republic of Belarus, 2020).
Corpus Download
To use any of the corpus collections, please use the following citation:
Braga, P. (2020). RUB Corpus and Code. Project repository. Available at: https://github.com/pjbraga/rub_corpus_and_code.
The RUB Corpus collections are currently available as zip files in .tsv format:
Russia | Ukraine | Belarus |
Russian policymakers (Presidents/Prime Ministers Vladimir Putin and Dmitry Medvedev, and Minister of Foreign Affairs Sergei Lavrov) collection, 2006–2016 | Ukrainian policymakers (Presidents Viktor Yushchenko, Viktor Yanukovych, Oleksandr Turchynov, and Petro Poroshenko. Prime Ministers Yuri Yekhanurov, Viktor Yanukovych, Yulia Tymoshenko, Oleksandr Yurchynov, Mykola Azarov, Sergei Arbuzov, Arseniy Yatsenyuk, and Volodymyr Groysman) collection, 2006–2016 | Belarusian policymaker (President Aleksander Lukashenka) collection, 2006–2016 |
Corpus Organisation
Corpus texts are arranged in .tsv files.
The first column gives a text’s publication date, the second column has corresponding href for the text, and the third column contains the text.
For example:
Questions or Issues
For issues with the RUB Corpus and Code repositories, please use GitHub.
For general questions about this project or any ideas for academic collaboration, contact Peter Braga at: pjbraga.rubcc@gmail.com.
Page References
Ambrosio, T. (2017). The fall of Yanukovych: Structural and political constraints to implementing authoritarian learning. East European Politics, 33(2), 184–209. Available at: https://doi.org/10.1080/21599165.2017.1304382.
Government of Russia. (2020). Government of Russia: News. Available at: http://government.ru/news/.
Government Portal. (2017). Government Portal [Government website]. Ukrainian Government Portal. Available at: http://old.kmu.gov.ua/kmu/control/publish/.
Internet Archive. (2020). Wayback Availability JSON API. Wayback Machine. Available at: https://archive.org/help/wayback_api.php.
Kremlin. (2020). President of Russia: Transcripts [Government website]. President of Russia. Available at: http://kremlin.ru/events/president/transcripts.
Loukachevitch, N. and Levchik, A. (2016). Creating a General Russian Sentiment Lexicon. In Proceedings of Language Resources and Evaluation Conference LREC-2016. Available at: http://www.lrec-conf.org/proceedings/lrec2016/pdf/285_Paper.pdf.
MFA of Russia. (2020). Ministerial Speeches: Minister of Foreign Affairs of the Russian Federation. Available at: https://www.mid.ru/ru/press_service/minister_speeches/.
Office of the Prime Minister. (2012). Archive of the Official Site of the 2008-2012 Prime Minister of the Russian Federation Vladimir Putin: Events. Available at: http://archive.premier.gov.ru/events/news/.
Pavlovsky, G. (2016).</strong> Russian Politics Under Putin: The System Will Outlast the Master. Foreign Affairs, 95(3), 10–17. Available at: https://www.foreignaffairs.com/articles/russia-fsu/2016-04-18/russian-politics-under-putin.
Presidential Press Service of the Republic of Belarus. (2020). Website of the President of Belarus: Archive [Government website]. President of Belarus. Available at: http://president.gov.by/ru/news_ru/archive/page/.
PyData Development Team. (2019, February 3). Pandas Documentation: Whats New in 0.24.1. Available at: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.1.html.
Reitz, K. (2019, February 18). Requests-HTML (version 0.10.0). GitHub. Available at: https://github.com/psf/requests-html.
Richardson, L. (2019, January 7). Beautiful Soup 4.7.1 (HTML parser). Available at: https://www.crummy.com/software/BeautifulSoup/bs4/download/4.7/.
Rudyj, K. V. (2020). “Nepohožie: Vzglâd na Kitaj i belorussko-kitajskie otnošeniâ” [Dissimilar: A Perspective on China-Belarusian Relations]. Zviazda: Minsk, Belarus. Available at: https://oz.by/books/more10931323.html.
Van Rossum, G., & Drake, F. L. (2018, October 20). Python 3.7.1. Available at: https://www.python.org/downloads/release/python-371/.