DRAFT
Article Title: A Machine Learning Approach to Recognizing Acronyms and their Expansions
URL: http://research.microsoft.com/en-us/people/junxu/acronymextraction-icmlc2005.pdf
Mirror URL: https://scottontechnology.com/wp-content/uploads/2016/03/acronymextraction-icmlc2005.pdf
Authors: Jun Xu. Ya-Lou Huang
Keywords: Acronym extraction; expansion; text mining; machine learning
Scott on Technology classifications
Reproducible research: No
Additional Keywords: Support Vector Machines, SVM
Programming language used: Unknown
Number of pages: 6
It is a decent overview paper, but lacking in sufficient details to implement. Generally, the approach uses a rules based approach to identify likely acronyms and candidate expansions and uses support vector machines (SVM) for “selecting genuine expansions from candidates.”
Sections 1-3 are the introduction, related works, and observations on “recognizing acronyms and expansions from text.”
Sections in detail
4.1 Identify Likely Acronyms
The overall process here is clear.  Interestingly, an observation they made earlier stated that, “acronyms are generally three to ten characters in length…” yet allow the likely acronym length to be between 2 to 10 characters.  Also for Step 3, they don’t indicate which dictionary they used or how they determine person name or location name. I also found it odd that they check the acronym candidate against an additional stop word list as the stop words should already be in the dictionary and I don’t know what purpose it serves in this step.  However, I have found that ignoring stop words is an important in a pattern based expansion generation step.
4.2 Generate Candidate Expansions
“We observed that expansions always occur in surrounding text where acronyms appear in and always in the same sentence.”  I have found this to be not true.  For example the following text:
During World War II, a number of Army personnel were stationed at the Orlando Army Air Base and nearby Pinecastle Army Air Field. Some of these servicemen stayed in Orlando to settle and raise families. In 1956 the aerospace and defense company Martin Marietta (now Lockheed Martin) established a plant in the city. Orlando AAB and Pinecastle AAF were transferred to the United States Air Force in 1947 when it became a separate service and were re-designated as air force bases (AFB).
Army Air Field and AAF are not near each other and not within the same sentence. If measured from after the “d” in field and up to the “A” in AAF, there are 215 characters between them. A more relaxed statement is that the expansions usually appear in the same paragraph as the acronyms.
4.3.2 Features
“lower case[sic], numeric, special characters, and white spaces…” doesn’t specify which factors are binary or real valued.
References
- Adar, E.: SaRAD: a Simple and Robust Abbreviation Dictionary. HP Laboratories Technical Report, 2004.
- Bowden, P. R.: Automatic Glossary Construction for Technical Papers, Nottingham Tient University, Department Working Paper, December 1999.
- Bowderi, P. R, Halstead, P, and Rose, T. G. Dictionaryless English Plural Noun Singularisation Using a Corpus-Based List of Irregular Forms, in: Ljung M., ed. In Proc. of 17th Int. Conf. on English Language Research on Computerized Corpora, Rodopi, Amsterdam, Netherlands. pp. 130-137.
- Chang., J. T., Schutze H., and Altman R. R.: Creating an Online Dictionary of Abbreviations from MEDLINE, J. Am. Med. Inform.. Assoc., 9.
- Hettich, S. and Bay, S. D.: The UCI KDD Archive. Irvine, CA: University of California, Department of Information and Computer Science.
- Larkey. L. S., Ogilvie, P.. Price, M. A., and Tamilio, B.: Acrophile: An automated acronym extractor and server. In the Proc. of 5th ACM Conf on Digital Libraries. San Antonio, TX: ACM Press, 2000.
- Park, Y., and Byrd, R J.: Hybrid text mining for finding abbreviations and their definitions. In Proc. of EMNLP, 2001.
- Pustejovsky J., Castano J., Cochran B., Kotecki M., and Morrell M.: Automatic Extraction of Acronym-meaning pairs from MEDLINE databases. In Proc. of Medinfo, 2001.
- Schwartz, A. S., and Hearst, M. A.: A simple algorithm for identifying abbreviation definitions in biomedical text. In Proc. of the Pacific Symposium on Bio-computing, 2003.
- Taghva, K. and Gilbreth, J.: Recognizing Acronyms and their Definitions. Information Science Research Institute, University of Nevada, Technical Repon TR 95-03.
- Vapnik, V.N.: The Nature of Statistical Learning Theory. by VN Vapnik. Berlin: Springer-Verlag, 1995
- Yeates, S.: Automatic Extraction of Acronyms from Text. In Proc. of the Fourth New Zealand Computer Science Research Students’ Conference, 1999
- Yeates, S., Bainbridge, D., and Witten, I.H.: Using compression to identify acronyms in text. In Proc. of Data Compression Conf., IEEE Press, New York, NY, p. 582.
- Yoshida, M., Fukuda, K., and Takagi, T.: PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary, Bioinformatics, 16, pp. 169-175.
- Yu, H., Hripcsak, G., and Friedman, C.: Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc, 2002.
- Acronym Finder: http://www.acronymfinder.com/
- The Canonical Abbreviation/Acronym List: http://www.astro.umd.edu/~marshall/abbrev.html (dead link) Current link: http://marshall.freeshell.org/abbrev.html
Notes/Edits
- Amsterdam is misspelled in the References and corrected here
- The link to The Canonical Abbreviation/Acronym List was dead and a current link is provided
- Some of the information presented here was generated through OCR methods. If you see any errors just drop me a note or add to the comments and we’ll get it corrected.
Takeaways:
- Use of the term “candidate” for potential expansions
- Use of the term “token” (this is common in the SVM/ML domain)
Interesting features to examine:
- Length of acronym
- Length of candidate expansion
- Expansion distance from acronym
Sources:
Proc. – Proceedings