Top

UniWay dataset simulates the interaction between students and the university’s back-office staff. The data consists of questions made by graduate students (including Erasmus students) of the Athens University of Economics and Business (AUEB)1 and the Aristotle University of Thessaloniki (AUTH)2 collected with a fully anonymous and GDPR compliant process. Students were asked to provide two versions of their questions in two English and Greek, whenever it was possible.

Based on the information provided by the staff working at the back offices of the two universities, we distinguished six different types of questions, which correspond to six different student intents as shown in Table 1. The same table contains the distribution of questions per intent in each of the two UniWay datasets (i.e. EN and GR). The English version of the dataset is slightly larger because some of the students who participated did not speak Greek.

Question TypeIntentGreekEnglish
What is a course’s weight in the final grade?I1 AvailableCourses365523
What are the available courses for the following semester?I2 CourseCredit361380
What courses I have passed in this exam period?I3 PassedCourses362360
Get my grade in a courseI4 GradeCourse362366
Get the contact info of a tutor.I5 TutorInfo364362
Get the tutor of a courseI6 TutorCourse362346
Total21762337

Table 1: The distribution of intent labels in the Greek and English versions of the UniWay dataset

The intent of each question has been manually defined by a team of experts. In order get an indication of the difficulty of the IE task, we asked two annotators to examine the same subset of English and Greek queries (120 queries each) and make a decision on the intent of each query. The inter-annotator agreement on this subset was 90% for the English dataset and 85% for the Greek one. The disagreements were mainly due to the fact that some sentences were not very clear, were missing key entities or terms that would help to decide on their intent, and thus could be interpreted differently. These ambiguous sentences mostly feel under the intents I1 (AvailableCourses) and I3 (PassedCourses). Nevertheless, a successful conversational agent should be able to simulate non-clear sentences in the most realistic way, so we decided not to have an UnknownIntent class in our dataset and such cases have been resolved by asking reviewers to reach a consensus.

The questions were then converted to lower case, cleared from punctuation and accents (the Greek utterances) and annotated for entities. Each word token which describes a named entity is matched with an entity from a predefined list, using IOB tags (e.g., begin, inside) and labels (e.g., tutor name), otherwise is matched with O (the “null” tag), as shown in the examples of Figure 1.

Figure 1: Sample questions with entities and intents from the UniWay EN dataset

The distribution of the 6 entities in each dataset is shown in Table 2. We can observe that intents are almost equally distributed in contrast to entities. Entity teacher name i appears very few times due to the fact that teachers are rarely searched by their first name.

EntitiesGreekEnglish
exam period b409479
exam period i488532
teacher name b322308
teacher name i2136
course name b1032969
course name i20301757

Table 2: The distribution of entity type labels in the Greek and English versions of the UniWay dataset

1https://www.dept.aueb.gr/cs
2https://www.csd.auth.gr/en

Citations

If you use our dataset in your research or find our repository useful, please cite our work

S. Rizou, A. Theofilatos, A. Paflioti, E. Pissari, I. Varlamis, G. Sarigiannidis, K.Ch. Chatzisavvas,
Efficient intent classification and entity recognition for university administrative services employing deep learning models,
Intelligent Systems with Applications,
Volume 19,
2023,
200247,
ISSN 2667-3053,
https://doi.org/10.1016/j.iswa.2023.200247

License

Uniway Dataset is available under Creative Commons BY-NC-SA 4.0 license

Download

Download the files of UNIWAY EN and UNIWAY GR here