The main goal of this research project is to develop a language independent method for automatic idiom recognition. Idiomatic expressions, such as 'a blessing in disguise' and 'kick the bucket' are plentiful in everyday language, though they remain mysterious, as it is not clear exactly how people learn and understand them. There is no single agreed-upon definition of idiom that covers all members of this class, but idioms tend to be relatively fixed in grammatical form and meaning, but with relatively little predictability in the relation between form and meaning. Also, many idiomatic expressions can appear with both literal, i.e. fully predictable, interpretations given their form -- compare 'The little girl made a face at her mother.' (idiomatic) vs. 'The little girl made a face on the snowman using a carrot and two buttons.' (literal) As a result, idioms present great challenges for a variety of natural language processing applications, including machine translation systems, which often do not detect idiomatic language. To address these challenges, an algorithm is proposed that neither relies on target idiom types, lexicons, or large manually annotated corpora, nor limits the search space by a particular type of linguistic construction. The starting point is that idioms are semantic outliers that violate cohesive structure, especially in local contexts. The following properties are quantified and are incorporated into the outlier detection algorithm: 1) lack of compositionality comparing to literal expressions or other types of collocations; 2) violation of local cohesive ties, so that they tend to be semantically distant from the local topics; 3) while not all semantic outliers are idioms, non-compositional semantic outliers are likely to be idiomatic; 4) idiomaticity is not a binary property; rather, idioms fall on the continuum from being compositional to being partly unanalyzable to completely non-compositional.
This research contributes to the better understanding of idiomatic language, to the computational treatment of such phenomena and, with the creation of high quality, publicly available linguistic resources annotated for idioms, to the facilitation of machine learning research and big data science. Additional benefits include efficient algorithms for computing compositionality and topicality from large corpora, interesting new generalizations about the nature of figurative language, and the training of a cadre of undergraduate and graduate students in highly practical work on a difficult interdisciplinary problem.
|Effective start/end date||8/1/13 → 1/31/18|
- National Science Foundation: $176,514.00