ApacheCon @Home - cTAKES Track

Apache cTAKES Track

Tuesday 16:15 UTC
Apache cTAKES: First Principles and Customization
Sean Finan

Built using Apache UIMA, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is a modular and extensible tool for Natural Language Processing. This is a quick start tutorial on adding custom elements to cTAKES. We illustrate creating simple classes to input, process and output data. This involves a concise overview of Apache uimaFIT and the cTAKES type system, as well as building a UIMA pipeline using piper files.

Sean Finan is a software developer in the Natural Language Processing lab at Boston Children's Hospital. He has worked with Apache cTAKES for the past 8 years, contributing code and supporting the community.

Tuesday 16:55 UTC
Integration UIMA components into cTAKES
Siamak Barzegar

Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) is an open-source Natural Language Processing system for extraction of information from Electronic Health Records (EHR). cTAKES consists of a number of components that work just with English documents. We integrated two important tools (HeidelTime and FreeLing) into cTAKES that provide language analysis functionalities (Temponym Tagging, Morphological Analysis, Named Entity Detection, PoS-Tagging, Parsing, Word Sense Disambiguation, Semantic Role Labelling, so forth) for a variety of languages. Also, we adapted HeidelTime’s grammar and FreeLing to the Medical domain in Spanish. Due to having different type systems in components of cTAKES and HeidelTime and FreeLing, we had interoperability challenges that were solved by adapting the native type system of cTAKES for HeidelTime and FreeLing’s Wrapper.

Siamak Barzegar is a Senior Research Engineer at Biomedical Text Mining Unit at Barcelona Supercomputing Center in Spain. He won the Science Foundation Ireland (SFI) research scholarship and received his PhD degree from the National University of Ireland, Galway in December 2018. The main area of his work/research is focusing on Natural Language Processing, Distributional Semantics, Word Embeddings, Deep Learning, Knowledge Extraction on Multilingual & Specific Domains.

Tuesday 17:35 UTC
Secret Engines of Apache cTAKES
Sean Finan

The Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) default pipeline is a standard in the natural language processing clinical research community. What is past that standard? While the default clinical pipeline uses almost 20 analysis engines, there are dozens more in various cTAKES modules. We present and discuss the top 5 annotation engines you never knew you had.

Tuesday 18:15 UTC
Advanced Dictionary use in Apache cTAKES
Sean Finan, Jeff Miller

Named Entity Recognition is at the core of all complete natural language processing tools. Out of the box clinical Text Analysis and Knowledge Extraction System (cTAKES) uses a dictionary containing part of the Unified Medical Language System (UMLS) that covers most common clinical terms. But it also comes with a custom dictionary creator. If you think that your clinical research is directed, then you should probably have a directed dictionary. UMLS subsets, non-english dictionaries and novel custom dictionaries have all been successfully used with cTAKES. This is an overview of cTAKES named entity recognition with the essential what, why and how of custom dictionaries as the centerpiece. Also discussed will be configuration to use discontiguous spans and subsumption of short terms.

Sean Finan:
Sean Finan is a software developer in the Natural Language Processing lab at Boston Children's Hospital. He has worked with Apache cTAKES for the past 8 years, contributing code and supporting the community.
Jeff Miller:
Jeff Miller leads a team of data scientists at the Children's Hospital of Philadelphia (CHOP). His work focuses on developing tools to help researchers analyze clinical data. Jeff holds a master's degree in applied statistics from Penn State University.

Wednesday 16:15 UTC
REST Support for Apache cTAKES
Gandhirajan N, Sean Finan

Apache cTAKES™ is a natural language processing system for the extraction of information from electronic medical record clinical free-text. It's predominantly a desktop-based application. This session will talk about enabling REST support in cTAKES. We will be setting up UMLS knowledge sources in MySQL DB using scripts generated by cTAKES Dictionary Creator GUI which in turn uses MetamorphoSys UMLS installation wizard. We will deploy the cTAKES web REST module in tomcat and the application will use the cTAKES engine to perform analysis of the payload passed via REST call against the MySQL DB source and returns the analysis findings as JSON. We will also have a quick demo of the steps mentioned above. This will help healthcare industry to perform NLP analysis using cTAKES engine with just a REST endpoint.

Gandhirajan N:
Software developer with 15 years of experience in product design and development. Currently working on developing cloud-native applications using Spring Boot and deploying the same in Azure. Apache committer in cTAKES and Cordova projects.
Sean Finan:
Sean Finan is a Software Developer in the CHIP-NLP group, contributing his experience to their ongoing projects that utilize and help expand the capabilities of Natural Language Processing. Originally a Geophysicist and Materials Scientist, Sean gained his interest in software development while creating computer simulations as analogues of physical processes studied in his laboratory research. After leaving academia and a year of employment at the Mayo Clinic, Sean moved to Houston to work eleven years with Landmark Graphics, the leading provider of scientific software for the energy industry. PMC and committer in Apache cTAKES project.

Wednesday 16:55 UTC
SpaCTeS: Extraction of Information on Diagnosis of Stroke from Electronic Health Reports
Siamak Barzegar

Most of the relevant data produced on stroke clinical settings consist of unstructured data (clinical narrative texts in Electronic Health Records (EHR). We tested new TM techniques to assist in the process of extracting relevant information from hospital discharge reports of patients diagnosed with a stroke (2016 to 2017). We developed a TM pipeline structured into iterative phases to gradually improve the quality of transforming narrative discharge reports into structured clinical data representations and generating good practice recommendations. The initial system was developed using Apache cTAKES, a natural language processing for information extraction from the EHR system initially developed by the Mayo Clinic. The main challenge was the heterogeneity of source data (3000 documents in Spanish and Catalan from 28 different hospitals). We developed an analysis tool to test the quality of texts by identifying missing information and non-standard usage of notations and vocabularies. The system also produced a normalized version of the texts. These results allowed us a detailed analysis of the stroke narrative records and the identification of aspects such as heterogeneity (and its problems) and degree of standardization, all of which are critical to enabling better exploitation of the information contained in EHR by TM approaches.

Wednesday 17:35 UTC
Customize cTAKES for Automated Adverse Drug Event Surveillance in Pediatric Pulmonary Hypertension
Chen Lin

Based on the Apache clinical Text Analysis Knowledge Extraction System (cTAKES), an open-source NLP system, we built a customized pipeline and processed 149,038 notes for 984 pediatric Pulmonary Hypertension (PH) patients for detecting textual mentions and signs/symptoms that may represent adverse drug events (ADE). Our pipeline featured a customized dictionary for interested term mentions and emphasized term negation, temporality of events, proximity among mentions, for a refined detection for co-occurrence of medications and potential drug effects. Analysis showed our automatic ADE detection system identified up to 7-fold higher ADE rates than those ascertained from diagnostic codes.

Chen is an Applications Development Specialist in the Children’s Hospital Informatics Program-Natural Language Processing (CHIP-NLP) group. Chen is actively incorporating statistical and machine learning technologies into advanced NLP tasks being investigated here at CHIP-NLP. Topics include automatic feature selection, coreference resolution, disease activity classification based on clinical narratives, etc. Chen has worked on several projects including the development of a novel complementary mining process that made use of unused features by a priori defined phenotypes; he authored interactive phenotype-mining and visualizing software; in addition Chen has research experience in deriving human cancer gene interaction networks based on genome-wide survival analysis.

Thursday 16:15 UTC
Extracting Patient Narrative from Clinical Notes : Implementing Apache Ctakes at scale using Apache Spark
Debdipto Misra

Patient notes not only document patient history and clinical conditions but are rich in contextual data and are usually more reliable sources of medical information compared to discrete values in the Electronic Health Record (EHR). For a medium-sized integrated Health System like Geisinger this amounts to approximately fifty thousand notes each day. For information extraction on retrospective data, the volume can run into millions of notes depending on the selection criteria. This talk describes the journey taken by the Data Science Team at Geisinger to implement a distributed pipeline which uses Apache Ctakes as the Natural Language Processing (NLP) Engine to annotate notes across the entire spectrum of patient care. From re-writing certain components in the Ctakes engine to architecting data store and pipeline optimization for a better throughput, this talk delves into various technical difficulties faced while aspiring to truly do NLP at scale on clinical notes. Towards the end, the talk also demonstrates few usecases and how using Ctakes has helped clinicians and stakeholders to extract patient narratives from patient notes using Apache Solr and Banana.

Debdipto Misra is a Data Scientist with Geisinger Health. Previously, he worked with AOL Inc. as a Platform Engineer in Audience Analytics and with EMC Corp. as a Systems Engineer. He has worked in the Data Mining and Analytics space for over half a decade. He won a fellowship and presented the “Evolution of Prosthetics using Pattern Recognition on Ultrasound Signals” at the 2014 IEEE Big Data Conference in Washington, DC. He has also published at multiple journals and presented at healthcare conferences like HIMSS. Currently,his main focus is on building capacity planning tools for healthcare organizations for bed-supply demand using various deep learning approaches and integrating it with patient notes.

Thursday 16:55 UTC
Fault-Tolerant, Distributed, and Scalable Natural Language Processing with cTAKES
Jeritt Thayer, Jeffrey Miller

Electronic health records contain a substantial amount of clinical information as unstructured free text. This information has the potential to enhance clinical decision making as well as provide insight for secondary health related research. Apache Clinical Text Analysis and Knowledge Extraction System (cTAKES) is a health specific natural language processing (NLP) system that has demonstrated success in the health care industry. However, analyzing large sets of notes with cTAKES can take months or even years to complete. By combining cTAKES with Apache Spark, we developed a fault-tolerant and scalable NLP pipeline that respects the single threaded limitation inherent in cTAKES pipelines. It is capable of processing millions of clinical notes in minutes on a large computing cluster. We have also configured the pipeline to make it easy to adjust common settings like changing negation detection algorithms and toggling whether or not to detect entities over discontinuous spans. At the completion of this session, you will have a practical example of processing large volumes of unstructured text using cTAKES and be able to identify the benefits of using different Apache distributed computing frameworks such as Spark and Beam.

Jeritt Thayer
Jeritt Thayer is a software engineer at Children's Hospital of Philadelphia. His work focuses on designing, developing, and evaluating novel systems to support patient engagement, medical decision making, and care delivery. Prior to his career in software, Jeritt was a professional soccer player. Jeritt is passionate about developing applications that support asynchronous and non-colocated communication to improve provider coordination and patient outcomes.
Jeff Miller:
Jeff Miller leads a team of data scientists at the Children's Hospital of Philadelphia (CHOP). His work focuses on developing tools to help researchers analyze clinical data. Jeff holds a master's degree in applied statistics from Penn State University.

Thursday 17:35 UTC
Apache cTAKES and Python; Apache cTAKES High Throughput Orchestration
Dmitriy Dligach, Sean Finan, Peter Abramowitsch

1. The rise of Natural Language Processing Machine Learning libraries in Python has created opportunities for the Apache clinical Text Analysis and Knowledge Extraction System (cTAKES). There are also challenges in utilizing the Java-based cTAKES type system across platforms. 2. We have built a high throughput orchestration mechanism to process and publish millions of redacted and unredacted notes in a PHI-safe environment and to manage refreshes where notes can continually be re-redacted, or obsoleted. We have a high-urgency stream of Covid related notes that are on a weekly refresh basis.

Dmitriy Dligach:
The overarching goal of Dr. Dligach's research is developing methods for automatic semantic analysis of texts. His work spans such areas of computer science as natural language processing, machine learning, and data mining. Most recently his research has focused on semantic analysis of clinical texts. He works both on method development and applications.
Sean Finan:
Sean Finan is a software developer in the Natural Language Processing lab at Boston Children's Hospital. He has worked with Apache cTAKES for the past 8 years, contributing code and supporting the community.
Peter Abramowitsch: Peter Abramowitsch started using cTAKES while working in the Hearst Health Innovation Lab. He is now an Architect and cTAKES Implementer in Bakar Computational Health Sciences Institute at the University of California, San Francisco.

Connect with us