Precision/accuracy based mining of PubMed data

May 4, 2009

www.xtractor.in/premium

Precision / accuracy of the mining PubMed data

The major problem that concerns most of the Biomedical researchers is the accuracy of the data mining. Making computers understand the human language and more so  analysis of the extracted data is still a problem with most of the text mining engines, which are based on Natural Language Processing  (NLP). Though NLPs have a quicker turnaround time, they are not accurate and majority of the time not comprehensive. We conducted an analysis of XTractor manual curation Vs one of the best Natural Language Processing engines in the Biomedical Space. We procured PubMed abstracts over different dates and passed them through the NLP, and the Xtractor team also manually annotated the same set for biomedical relevance.

We found that our manual curators reduced the false positive rate of NLP picked abstracts by more than 10-38% or in other words, our manual annotation effort enabled us to:

  1. Pick additional abstracts that were totally missed by the NLP
  2. Abstracts that were wrongly annotated for Proteins, Diseases, Drugs and Biological Processes.

Some Examples of the miss outs by the NL P include:

Common English term mismatches: MICE, PEG, DAMAGE, RAW, which overlap with protein names.

Common Isoform Mismatches: p16-INK4 to p14ARF and cd11c to cd11d

Common Protein mismatches: S1P (sphingosine-1-phosphate) matched to sphingosine-1-phosphate receptor and ERK to ephrin type-B receptor 2

Protein-disease mismatch: VHL protein mismatched to von-hippel lindau disease and progressive multifocal leukoencephalopathy mismatched to PML protein

Protein-process Mismatches: cell growth tagged to growth factor

Protein drug Mismatches: rapamycin tagged to protein Mammalian target of rapamycin

So both these above aspects amounted to an average of 10-38% false positives.

Solution:
XTractor manual annotation involves initially screening of all the relevant records from PubMed on an everyday basis. This is followed by manual annotation and sentence categorization. Finally the categorized sentences are again manually quality checked for accuracy and automated validations are run on them to avoid false positives from getting into the knowledgebase.

So each fact presented in the XTractor knowledgebase passes through 2 rounds of quality check’s before it’s presented to the user. So you can be rest assured that the data provided will be more than 98-99% accurate.

ROI:

Erroneous data from text mining engines can lead to wrong assumptions and hypothesis building, which may finally lead to major losses in your drug discovery program which may amount to millions of dollars. So try XTractor the world’s first knowledgebase of manually curated PubMed data every day . FREE Trial at  http://www.xtractor.in/premium/trial.do

text mining, manual annotation, data alerts, colloborations, pubmed, curation, genes, drugs, processes, diseases, free, data mining, tag, annotations, drug discovery, web 2.0, biomedical literature, publishing, abstracts, natural language processing, NLP, data analysis, visualization, concept linking, abstraction, categorization, precision, recall, data accuracy, proteins, interactions, molecules, text gathering, indexing, index, query, MeSH, biological process, protein function, NLM, accuracy, accurate data, manual curation, curate, annotate, annotations, colloborations, curation, data alerts, data mining, diseases, drug discovery, drugs, free, genes, manual annotation, processes, pubmed, tag, text mining,text mining, manual annotation, data alerts, pubmed, genes, drugs, processes, diseases, free, data mining, tag, drug discovery, web 2.0, natural language processing, data analysis, visualization, concept linking, abstraction, precision, recall, data accuracy, proteins, interactions, index, query, MeSH, NLM, manual curation, protein interactions, abstraction, abstracts, accuracy, accurate, data, annotations, biological process, biomedical literature, categorization, colloborations, concept linking, curate, curation, data accuracy, data alerts, literature, categorization, colloborations, concept linking curate curation data accuracy data alerts da, abstracts, annotations, biological process, categorization, colloborations, data accuracy, data analysis, data mining, drug discovery, MeSH, molecules, natural language processing, NLM, NLP, precision, processes, protein function, protein interactions, publishing, pubmed, query, recall, tag, text gathering, visualization, web 2.0

Entry Filed under: text mining. Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , .

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

May 2009
M T W T F S S
« Apr   Jul »
 123
45678910
11121314151617
18192021222324
25262728293031

Most Recent Posts