Uniway

UniWay dataset simulates the interaction between students and the university’s back-office staff. The data consists of questions made by graduate students (including Erasmus students) of the Athens University of Economics and Business (AUEB)¹ and the Aristotle University of Thessaloniki (AUTH)² collected with a fully anonymous and GDPR compliant process. Students were asked to provide two versions of their questions in two English and Greek, whenever it was possible.

Based on the information provided by the staff working at the back offices of the two universities, we distinguished six different types of questions, which correspond to six different student intents as shown in Table 1. The same table contains the distribution of questions per intent in each of the two UniWay datasets (i.e. EN and GR). The English version of the dataset is slightly larger because some of the students who participated did not speak Greek.

Question Type	Intent	Greek	English
What is a course’s weight in the final grade?	I₁ AvailableCourses	365	523
What are the available courses for the following semester?	I₂ CourseCredit	361	380
What courses I have passed in this exam period?	I₃ PassedCourses	362	360
Get my grade in a course	I₄ GradeCourse	362	366
Get the contact info of a tutor.	I₅ TutorInfo	364	362
Get the tutor of a course	I₆ TutorCourse	362	346
Total		2176	2337

Table 1: The distribution of intent labels in the Greek and English versions of the UniWay dataset

The intent of each question has been manually defined by a team of experts. In order get an indication of the difficulty of the IE task, we asked two annotators to examine the same subset of English and Greek queries (120 queries each) and make a decision on the intent of each query. The inter-annotator agreement on this subset was 90% for the English dataset and 85% for the Greek one. The disagreements were mainly due to the fact that some sentences were not very clear, were missing key entities or terms that would help to decide on their intent, and thus could be interpreted differently. These ambiguous sentences mostly feel under the intents I₁ (AvailableCourses) and I₃ (PassedCourses). Nevertheless, a successful conversational agent should be able to simulate non-clear sentences in the most realistic way, so we decided not to have an UnknownIntent class in our dataset and such cases have been resolved by asking reviewers to reach a consensus.

The questions were then converted to lower case, cleared from punctuation and accents (the Greek utterances) and annotated for entities. Each word token which describes a named entity is matched with an entity from a predefined list, using IOB tags (e.g., begin, inside) and labels (e.g., tutor name), otherwise is matched with O (the “null” tag), as shown in the examples of Figure 1.

Figure 1: Sample questions with entities and intents from the UniWay EN dataset

The distribution of the 6 entities in each dataset is shown in Table 2. We can observe that intents are almost equally distributed in contrast to entities. Entity teacher name i appears very few times due to the fact that teachers are rarely searched by their first name.

Entities	Greek	English
exam period b	409	479
exam period i	488	532
teacher name b	322	308
teacher name i	21	36
course name b	1032	969
course name i	2030	1757

Table 2: The distribution of entity type labels in the Greek and English versions of the UniWay dataset

¹https://www.dept.aueb.gr/cs
²https://www.csd.auth.gr/en

Citations

If you use our dataset in your research or find our repository useful, please cite our work

S. Rizou, A. Theofilatos, A. Paflioti, E. Pissari, I. Varlamis, G. Sarigiannidis, K.Ch. Chatzisavvas,
Efficient intent classification and entity recognition for university administrative services employing deep learning models,
Intelligent Systems with Applications,
Volume 19,
2023,
200247,
ISSN 2667-3053,
https://doi.org/10.1016/j.iswa.2023.200247

License

Uniway Dataset is available under Creative Commons BY-NC-SA 4.0 license

Download

Download the files of UNIWAY EN and UNIWAY GR here