Identifying Technical Terms
Loading...
Date
2003
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Te Herenga Waka—Victoria University of Wellington
Abstract
This thesis examines four possible approaches in identifying technical terms:
(1) using the meaning of a word (referred to hereafter as the rating scale approach),
(2) using clues provided in the text (the clue-based approach),
(3) using a technical dictionary (the dictionary-based approach), and
(4) using the range and frequency of word forms (the corpus comparison approach).
As a pilot test, the four approaches were applied to a 5500 token anatomy corpus in order to decide which approach to identifying terms was the most effective.
In order to identify terms using the meaning of a word, a four-point scale was designed according to the specificity of the meaning of each word to the subject of anatomy. Then an interrater reliability check was carried out to show to what extent the measures were agreed on by different raters. The accuracy score was 0.95.
The rating scale approach was used as the basis for comparing and further evaluating other approaches to finding technical terms. As a result of this comparison, the clue-based approach and the dictionary-based approach were found to be unsatisfactory with around a 48% or 56% overlap with items identified by the rating scale approach.
The clue-based approach had limitations due firstly to the nature of the clues and secondly to the fact that the writer selected items (and used clues to signal them) for his or her own purposes which had little or nothing to do with identifying terms. Furthermore it was not a practical approach because it required a great deal of extra decision making.
The dictionary-based approach had weaknesses as a method due to the fact that the basis for including words in a dictionary is not clear and consistent, and the main goal of a medical dictionary is not to mark off technical terms but to assist people to comprehend unknown words in medical discourse. Dictionary makers therefore include many words that are not terms.
The fourth way of identifying terms, the corpus comparison approach, used a ratio based on the range and frequency of word forms involving a technical base corpus and a general comparison corpus. The pilot study revealed that the corpus comparison approach is quite effective in identifying technical terms and their common collocates, and is reasonably simple and practical because (1) extra judgment is not required to the same extent as in the clue-based approach, (2) checking the meaning of each word by looking at the context or by looking up the technical dictionary is not required, and (3) sorting items and calculating formulas can be done using the computer.
To check the advantages described above, the corpus comparison approach (using a ratio based on range and frequency) was applied to a 452,192 token anatomy corpus and a 93,445 token applied linguistics corpus. This approach worked reasonably consistently on the two quite different kinds of technical text and there was around 85% overlap between the items identified by the corpus comparison approach and the rating scale approach.
These results suggest that if the aim is to make a rough estimate of the number of terms and their coverage of a technical text, the corpus comparison approach is satisfactory. However it is not adequate if the aim is to obtain a complete definitive list of terms. The more valid and reliable approach (but unfortunately the least practical and most labour intensive) is to use the rating scale approach.
Description
Keywords
Computational linguistics, Technical English, English language, Word frequency, Terms and phrases, Vocabulary