blog




  • Essay / A pattern matching approach to finding IUPAC names in...

    Chemical substances or entities are important terms in chemistry publications and patents. Various representations are available to represent chemical entities such as IUPAC, trivial names, SMILES, InChI and CAS registry numbers. Chemical names pose a particular challenge in information retrieval because they are typically long, complex expressions and subject to variation, which can lead to decreased retrieval performance. The difficulty of obtaining manually annotated data for training NER systems has motivated researchers to look for other ways to generate annotated data or to best utilize unlabeled data. Several systems address the problem of chemical entities with various approaches. In this paper, we present a pattern matching approach for finding IUPAC names in chemical documents. Alexander Vasserman [11] (2004) identifies chemical names in biomedical text using approaches based on substring co-occurrence. In this work, models were constructed based on the difference between chains appearing in chemical names and chains appearing in other words. Models are trained from a dictionary of chemical names and general biomedical texts. A new way to interpolate N-grams has been introduced, requiring no parameter tuning. Zornitsa Kozareva [5] (2006) proposed and implemented a model validation search in an unlabeled corpus through which gazetteer listings were automatically generated. Gazetteers were used as features by a named entity recognition system. A comparative study of the information provided by geographical directories in the entity classification process was presented. Andreas Vlachos et al. [3] (2006) empirically demonstrated the effectiveness of using tra...... middle of paper ......Tim Rocktäschel, Michael Weidlich and Ulf Leser, 2012, [2] presented a Named entity recognition tool for identifying mentions of chemicals in natural language text, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. They used a hybrid approach combining a conditional random field and a dictionary. It achieves an F1 measure of 68.1% on the SCAI corpus, outperforming the chemical NER tool OSCAR 4. A common problem in chemical NER is the scarcity of annotated corpora for training. In this work, we use chemical research papers from Indian Journal of Chemistry (Section B) for extraction of chemical terms using pattern matching and the extracted entities are evaluated using dictionary ChEBI of molecular entities which uses the nomenclature of the International Union of Pure and Molecular Entities. Chemistry applied to chemical entities.