Classication of ctionalWhat-if ideas Martin nidaric, Jasmina Smailovic Department of Knowledge Technologies {martin.znidarsic, jasmina.smailovic}@ijs.si Joef Stefan Institute Jamova cesta 39 Ljubljana, Slovenia ABSTRACT The paper presents the process and the outcomes of machine learning the human evaluation model of textually represented ctional ideas. The ideas that we are targeting are generated articially by various computational creativity systems. The study presented in this paper is one of the rst that employ machine learning in the challenging domain of evaluating the articially generated creative artefacts. Our results conrm the validity of the approach and indicate the dierences in importance of the features used. INTRODUCTION 1. The work presented in this paper is addressing the problem of machine learning the dierences among good and bad ctional What-if ideas that are represented as human readable text. Fictional What-if ideas are ideas such as: What if there was a city built in the air? that start with a What if proposition and describe a ctional situation, i.e., a situation that is not realistic or is at least not commonly considered plausible. The ideas of this kind are practically used in art, literature, entertainment industry (e.g., movie plots) and advertisement. In these domains the ideas are the main driving force of creative work, and their production, also of the What-if kind, is an essential part of the conceptualization phase of product development. The What-if ideas that we are focused on are not only ctional, but also computer generated. Automated production of such creative artefacts belongs into domain of computational creativity, a subeld of articial intelligence that is concerned with understanding, simulation and recreation of creative behaviours [1, 7, 8]. Although production of ctional ideas is deemed creative work and as such eludes solutions with computational means, there are now automatic What-if idea generators being developed [5, 9], mostly in the scope of the WHIM project1. The idea generators can be parameterized in various ways to generate large amounts of various avours of What-if ideas. However, most of these automatically generated ideas are noisy and of low quality, particularly from the generators that are producing less restricted outputs, which despite a larger proportion of noise, usually produce the more interesting and valuable results. Automated evaluation of such results is very dicult, but it would also be very benecial, if it could be made possible. Our aim is to create such an evaluation system with the means of machine learning. More specically, based on human-labelled data we intend to create (components of) a human evaluation model of ctional What-if ideas. There are two questions that we are aiming to answer with our work: (I) can an evaluation model for What-ifs be constructed automatically, and (II) which features of the Whatifs are the most relevant in this respect? Evaluation is dicult and controversial in the context of computational creativity systems [6, 4]. Namely, for the outputs of these systems there is often no common measure of appropriateness or value. We intend to create a general evaluation model, which might suppress some subjective views and should reect the general population as much as possible. For this reason, we are using the crowdsourcing approach through an open online evaluation platform and crowdsourced questionnaires. Results of our experiments indicate that, despite the challenging nature of the problem, we can address it with machine learning methods. Namely, despite relatively low classication performance results, the evaluation models improve upon the baseline, which is not an obvious result in such a challenging domain. Regarding the feature importance, the outcomes are not straightforward. Benets of using the words from the What-ifs are evident, while the other features have a much less profound impact on performance. 2. DATA The aim of our data gathering is the acquisition of human evaluations of the computer-generated What-if ideas. This data serves as the basis for the development of the audience evaluation model. We are using two kinds of opinion sources: (I) the assessments and opinions of anonymous visitors of our open online platform2 and (II) the results from crowdsourced questionnaires with a specic experimental target. 1www.whim-project.eu 2http://www.whim-project.eu/whatifmachine/ Table 1: Label distribution for each generator. Gen. Sum 5 1 Labels in the open platform 2 3 4 Alt Dis Kaf Met Mus Uto all Alt Dis Kaf Met Uto Test all 603 1,043 315 111 30 563 2,665 337 647 188 143 33 448 1,796 277 680 168 204 16 434 1,779 211 776 189 267 37 466 1,946 176 551 154 180 21 318 1,400 Labels in the questionnaires 929 861 923 657 769 1,527 5,666 933 952 880 855 824 745 5,189 1,239 1,349 1,239 1,398 1,336 935 7,496 891 861 920 1,032 930 632 5,266 403 411 454 497 519 413 2697 1,604 3,697 1,014 905 137 2,229 9,586 4,395 4,434 4,416 4,439 4,378 4,252 26,314 We will denote these opinion sources as open and targeted respectively. The rst one is intended for gathering simple casual opinions from the general public, while the aim of the second is to gather data for specic experiments. The assessment procedure in the online platform is tailored to the online context and favors simplicity and clarity of the interface over thoroughness of the assessments. The questionnaires, on the other hand, are more controlled as we can decide to accept only fully completed questionnaires, we can ask for demographic data and we can evaluate more complex concepts (such as novelty, narrative potential, etc.), since the paid evaluators are expected to devote more eort to the task. 2.1 Datasets Two datasets were prepared for the work presented in this paper as described in the following. 2.1.1 Open: from the open online platform The rst dataset is a collection of evaluated What-if sentences obtained using the open online platform. What-ifs on this platform were created by 6 dierent generators Alternative Scenarios, Disney, Kafkaesque, Metaphors, Musicals, and Utopias and Distopias. The anonymous visitors rated the What-ifs of a selected generator on a 5-point Likert scale from 1 to 5. They could also report a sentence as oensive/incorrect or comment it. Using the online platform we acquired 9,586 labels for 5,908 dierent What-if sentences. The label distributions for each generator are shown in Table 1. Since some What-if sentences were labeled multiple times by dierent people, we merged multiple labels of one sentence into one label by calculating their median value. We considered the What-ifs marked with labels 2 or less to belong to the negative class, those with labels 4 or more to the positive class and the others to the neutral class. The nal label/generator distribution is shown in Figure 1. In the experiments and evaluation, we used only positively and negatively scored What-if sentences. There are 1,881 of the rst kind and 2,600 of the latter. The majority class represents 58.02% of this dataset. Figure 1: Label distribution for the open dataset. Figure 2: dataset. Label distribution for the targeted 2.1.2 Targeted: from crowdsourced questionnaires The second dataset consists of a collection of What-if sentences labelled in crowdsourced3 questionnaires. What-ifs were created by 5 dierent generators Alternative Scenarios, Disney, Kafkaesque, Metaphors, and Utopias and Distopias. There were also two manually constructed Whatifs for testing purposes (denoted as Test). The annotators labeled What-ifs based on several criteria, however in this study we only analyze scores of the overall impression. The annotators rated the What-ifs on a 5-point Likert scale from 1 to 5 to express their overall impression about them. We acquired 26,314 labels for 2,002 What-if sentences. The label distributions for each generator are in Table 1. As in the open dataset, we merged multiple labels into one (median) label, which was mapped into negative/neutral/positive class. The resulting label distribution is shown in Figure 2. The nal targeted dataset contains all the unique positively scored (190) and negatively scored (414) examples, with the majority class covering 68.54% of items. 2.2 Data features The data items that we collect consist of a textual representation of a What-if and an assigned label from the [1..5] interval. For the use in our experiments, we constructed a number of features to better represent each data item for the machine learning purposes. Merging of the labels was simple, as we have used a median value of the assessments in cases when there were more than one for a given item. 3We used the crowdsourcing platform CrowdFlower (http://www.crowdflower.com/) 57.94%41.33%48.21%27.67%45.52%43.18%19.85%32.62%19.65%23.75%12.69%23.94%22.21%26.05%32.14%48.57%41.79%32.88%020040060080010001200140016001800AltDisKafMetMusUtonegativeneutralpositive10581263865842134174626.00%23.25%24.75%12.00%17.25%50.00%69.50%69.75%65.50%76.25%68.25%50.00%4.50%7.00%9.75%11.75%14.50%0.00%050100150200250300350400AltDisKafMetUtoTestnegativeneutralpositive4004004004004002 Denition of the feature space was more complex and is described in the following, separately for the pure textual features and for some additional ones. 2.2.1 n-grams In the process4 of constructing textual features we rst applied tokenization and stemming of the What-if sentences. Furthermore, we constructed n-grams of length 1 (i.e., unigrams) and 2 (i.e., bigrams), and discarded those whose number of occurrences in the dataset was lower than 2. Finally, we created feature vectors by employing the term frequency feature representation. However, as our initial experiments showed only a minor advantage of bigrams compared to unigrams (accuracy of 62.47% compared to 62.45% on the open dataset, and 74.95% compared to 74.92% on the targeted dataset), we have used only the latter, which allowed for much faster experimental work. 2.2.2 Additional features In addition to n-grams, we have used the following additional features of each What-if : ber of characters; length: is the length of a What-if in terms of the num ambiguity: is an assessment of ambiguity according to the number of dierent meanings of a term in Wordnet5. It corresponds to the average number of Wordnet senses of known content words in the text; rhyming: denotes if there is a pair of words in the sentence that rhyme. It corresponds to the number of word combinations in the What-if that rhyme, given the rhyming level (number of ending syllables that have to match; we have used 2 in our experiments); number of adjectives: corresponds to the number of number of verbs: corresponds to the number of verbs adjectives that appear in a What-if ; and in a What-if . A feature that we currently have access to is also the generator the algorithm that generated a specic What-if idea. The positive and negative What-ifs are similarly distributed with respect to generators (as shown in Figures 1 and 2), but some distinctive dierences can be observed and they could contribute to classication performance. However, as the generators are constantly being updated and as we want our evaluation system to be general and not limited with respect to a set of specic generators, we did not use these features in our datasets. 3. EXPERIMENTS AND EVALUATION In this section we present the methodology of experimental work and the results of evaluation on the open and targeted datasets. 3.1 Methodology For classication we employed the SVM classier6. Evaluation was performed using 10-fold cross validation. The 4The process was performed using the LATINO toolbox (http://source.ijs.si/mgrcar/latino) 5https://wordnet.princeton.edu/ 6We used the wrapper around the SVMLight [3] classier in the LATINO library. Table 2: Experimental comparison of classiers learned on the open dataset with various feature sets. Results of ten 10-fold CV experiments with dierent folds are shown. In bold is the average over all ten experiments and in the last row, the average ranks (according to ranking in each experiment). oth(3c) w+oth(3c) words 62.57% 60.54% 62.35% 62.31% 59.99% 62.64% 62.53% 60.59% 62.58% 62.49% 59.87% 62.35% 62.40% 59.43% 62.46% 62.51% 59.65% 62.35% 62.55% 60.46% 62.53% 62.33% 60.37% 62.49% 62.44% 60.57% 62.51% 62.40% 60.57% 62.58% 62.45% 60.20% 62.46% 60.46% 62.48% oth(2c) w+oth(2c) 60.72% 59.83% 60.03% 60.61% 59.92% 60.72% 60.72% 60.68% 60.72% 60.68% 62.33% 62.58% 62.58% 62.33% 62.46% 62.33% 62.51% 62.40% 62.46% 62.62% 2.2 4.8 2.2 4.2 1.6 Table 3: Experimental comparison of classiers learned on the targeted dataset with various feature sets. Results of ten 10-fold CV experiments with different folds are shown. In bold is the average over all ten experiments and in the last row, the average ranks (according to ranking in each experiment). oth(3c) w+oth(3c) words 74.92% 75.22% 68.89% 73.36% 74.85% 68.75% 73.83% 74.84% 68.86% 74.69% 74.84% 68.09% 74.83% 75.83% 68.72% 74.98% 74.65% 68.97% 74.32% 74.32% 67.36% 73.80% 75.30% 68.85% 73.34% 74.17% 69.43% 75.16% 67.86% 75.50% 74.92% 68.58% 74.31% 67.04% 74.36% oth(2c) w+oth(2c) 67.42% 67.41% 67.88% 66.43% 66.54% 67.19% 65.89% 67.55% 68.07% 66.03% 74.57% 73.19% 73.68% 74.69% 74.85% 74.81% 74.81% 74.30% 73.34% 74.82% 1.45 4 2.4 5 2.15 experiments were performed 10 times each time with dierent examples in folds. We experimented with using dierent feature sets: (I) only n-grams (words), (II) only 3-class additional features, where the borders between three classes were determined by discretization of the sorted feature data into 3 equal frequency partitions (oth(3c)), (III) both n-grams and 3-class additional features (w+oth(3c)), (IV) only 2-class additional features, where the border between two classes was the median feature value (oth(2c)), and (V) both n-grams and 2-class additional features (w+oth(2c)). The evaluation results and the average ranks of feature sets are shown in Tables 2 and 3. 3.2 Discussion of results According to results in Tables 2 and 3, the machine-learned classiers were able to distinguish (to a limited extend, of course) between the What-ifs that are generally considered good and the ones that are considered bad. Namely, all the classiers that consider all the available features were able to beat the baseline of the given classication problem. Among Table 4: The features that are relatively (in one class, compared to the other) the most represented in positively and negatively scored What-ifs for the open and the targeted dataset. dataset open targeted positive used were become and their by embrace beautiful stars engineer not all its lies used invent stories tell talk inherit di 204 176 159 99 89 34 27 24 23 22 31 31 30 16 14 14 11 9 8 7 negative there was if what ? a numvb who ambiguity rhyming what if ? a there was numvb who numadj rhyming di -826 -723 -719 -719 -719 -693 -642 -624 -511 -471 -224 -224 -224 -222 -211 -209 -195 -179 -176 -172 the classiers, there is a notable dierence in performance of those that use the words of What-ifs as features and those that do not, as the former (denoted: words, w+oth(3c) and w+oth(2c)) consistently yield higher accuracies. Statistical analysis with the suggested [2] Friedman test indicates that the dierences among these classiers are not signicant. It has to be noted though that the test is very conservative, particularly in situation of only two independent datasets. As there is no statistical test available for multiple classiers on a single dataset and as repeated 10-fold CVs on the same dataset cannot be strictly regarded as independent, we report the average ranks of the experiments for all the classiers. These ranks are nevertheless considered [2] as a fair and useful method of comparison. Using the words as features is clearly benecial, while the usefulness of the other features of the What-ifs is not so evident. In the open dataset it seems that the use of additional features is benecial, as the average accuracies are higher, but this is not the case in the targeted dataset. However, in some non-reported experiments in which we set the discretization borders manually in an arbitrary fashion (the reason for ommiting them from Tables 2 and 3), the best results were gained with such a combined feature set, e.g.: 62.51% accuracy with a somewhat worse rank than the w+oth(2c) for the open dataset and 74.95% with best rank for the targeted. Although this does not aect the signicance of the results, it indicates the importance of feature preprocessing (discretization method in this case). Another indication that the additional features have some impact is shown in Table 4, where we present the top ten features that are relatively the most represented in either one or the other class for each dataset. The features with an or sign represent high or low values of such an additional feature. The features in quotes are the n-grams (words). Appearance of additional features among the top most representative ones indicates that some of them do have a notable relation with the general human perception of What-if texts. 4. CONCLUSIONS In the paper, we presented the data collection and modelling processes aimed at generating a data-based evaluation model for What-if ideas. According to results of experiments, the created models manage to reect the human evaluation behaviour, but to a limited extend. Importance of the features remains an open question, as the results clearly indicate the benet of using the words that appear in textual representations of What-ifs, but are not conclusive regarding the importance of the more complex additionally computed features. With more data which we expect in the future, we intend to build upon this work and shed more light on the characteristics of feature construction and selection in this challenging problem domain. 5. ACKNOWLEDGMENTS This research was supported by the Slovene Research Agency and through EC funding for the project WHIM 611560 by FP7, the ICT theme, and the Future Emerging Technologies FET programme. 6. REFERENCES [1] S. Colton and G. A. Wiggins. Computational creativity: the nal frontier? In ECAI, volume 12, pages 2126, 2012. [2] J. Demsar. Statistical comparisons of classiers over multiple data sets. The Journal of Machine Learning Research, 7:130, 2006. [3] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods Support Vector Learning, chapter 11, pages 169184. MIT Press, Cambridge, MA, 1999. [4] A. Jordanous. A standardised procedure for evaluating creative systems: Computational creativity evaluation based on what it is to be creative. Cognitive Computation, 4(3):246279, 2012. [5] M. T. Llano, R. Hepworth, S. Colton, J. Gow, J. Charnley, N. Lavrac, M. Znidarsic, M. Perovsek, M. Granroth-Wilding, and S. Clark. Baseline methods for automated ctional ideation. In Proceedings of the International Conference on Computational Creativity, 2014. [6] G. Ritchie. Some empirical criteria for attributing creativity to a computer program. Minds and Machines, 17(1):6799, 2007. [7] R. Saunders. Towards autonomous creative systems: A computational approach. Cognitive Computation, 4(3):216225, 2012. [8] T. Veale. A service-oriented architecture for computational creativity. Journal of Computing Science and Engineering, 7(3):159167, 2013. [9] T. Veale. Coming good and breaking bad: Generating transformative character arcs for use in compelling stories. In Proceedings of ICCC-2014, the 5th International Conference on Computational Creativity, Ljubljana, June 2014, 2014.