RUB Corpus and Code

RUB Corpus and Code are two downloadable, open-source collections.

The RUB Corpus is a collection of Russian-language official government speeches, interviews, and press releases made by top policymakers in Russia, Ukraine, and Belarus from 2006 to 2016.

The Code represents the programs used to compile the RUB Corpus and to conduct a lexicon-based sentiment analysis upon the RUB Corpus.

The sentiment analysis was conducted using a modified version of the lexicon created by Loukachevitch and Levchik (2016).

Both the RUB Corpus and Code can be downloaded from their respective website pages (see the Corpus and Code links on the navigation bar), or they can be downloaded from this website’s GitHub repository.

About this Website

The site provides access to the RUB Corpus and Code collections and explains how they can be used.

Some examples of the sentiment analysis results produced by the code and corpus can be found on the Data Visualisation page.

Any changes to this website or the data linked to it will be listed in the News page of this site.

About the RUB Corpus and Code Project

The RUB Corpus and Code collections are an academic project.

This project was created by Peter Braga as part of his doctoral research at the UCL School of Slavonic and East European Studies.

Citation for this Project

To cite any part of this website (such as where the RUB Copus and Code repositories can be found), please use the following citation:

Braga, P. (2020). RUB Corpus and Code. Project repository. Available at: https://github.com/pjbraga/rub_corpus_and_code.

For use of the RUB Corpus and Code repositories (for a project or any academic research), please cite the above website citation.

Expanding the RUB Project

The next project aim is to expand the RUB Corpus to include presidential and prime ministerial texts up until 2021.

In addition, where it is possible, the list of politicians and forums being sourced would also be expanded. For example, notable members of the Russian Duma and Ukrainian Rada could be added to the RUB collection.

Another important project connected to the RUB Corpus and Code is the building of Russian-language sentiment analysis training datasets.

Please use the contact below if you have any interest in participating to improve or to expand this project.

Contact

For issues with the RUB Corpus and Code repositories, please use GitHub.

For general questions about this project or any ideas for academic collaboration, feel free to contact Peter Braga at: pjbraga.rubcc@gmail.com.

Page References

Loukachevitch, N. and Levchik, A. (2016). Creating a General Russian Sentiment Lexicon. In Proceedings of Language Resources and Evaluation Conference LREC-2016. Available at: http://www.lrec-conf.org/proceedings/lrec2016/pdf/285_Paper.pdf.