The primary goal of this set of modules is taking text processing capabilities up one level of abstraction. Many of the items listed below already exist as freely-available libraries. These libraries are fairly low-level, though, and successful use of them requires dedicated effort and domain knowledge in many cases. Word embeddings provide the ability to easily find synonyms, for instance, and are often compared on how good they are at that task. However, synonym-finding has no clear connection to typical NLP problems users tend to have, such as text summarization, knowledge identification and extraction, or matching. Such problems are at a higher level of abstraction and work is needed to determine how to use what word embeddings provide to solve these higher-level problems.
What we aim to provide with the components below is a set of building blocks that can be used to solve problems at this higher level of abstraction. Special attention has been paid to keeping latency low and throughput high in the core libraries, which enables real-time experimentation and exploration, and supports quick machine learning training and validation cycles.
Among what is listed, except the items explicitly labeled experimental, all code has been extensively tested with real world data at real world scale over the course of three years with several large corporate customers.
rosetta were all used in production systems that handled O(100 million) document corpora.
ginomai is production ready, but was never launched in a product.
Regarding performance, most of these components were tuned for low-latency, high-throughput operation. In a production system with O(100 million) text items, a match of user-generated text across all items could be done in under 3 seconds round trip on a single rack-server-grade server. The pipeline preprocessed user text to expand acronyms and jargon, converted the result to a vector, performed nearest-neighbor search on the pre-indexed O(100 million) text items, performed metadata lookup, and finally collated and returned the results.
Regarding technology, the vast majority of code is written in scala 2.12 and is suitable for use in cloud environments. The deployed applications described below used Dockerized services wrapped around the described components and were orchestrated/deployed using Kubernetes.
Components and capabilities
Core text processing library. Its main feature is high throughput, low latency word embedding and matching using optimized Explicit Semantic Analysis and Locality-Sensitive Hashing techniques ("scale up"). A less-optimized module for GloVe embeddings is also included, and straightforward interfaces are provided to add additional word embeddings. These methods can be used singly or combined to index text corpora for later matching, analysis, etc. It can also represent arbitrary-sized texts as fixed-width vectors or bitstrings, which enables machine learning applications. Finally, it has experimental techniques for manipulating the representation with boolean operators to support boolean querying, as well as for adapting the representation using feedback about match quality to support implementing adaptation to user feedback and similar features.
Compute cluster for both indexing and matching with Topsy ("scale out"). Allows for high throughput, low latency matching over large numbers (> O(100 million)) of text fragments on commodity hardware as well as arbitrary scale out to decrease latency, increase throughput, or increase corpus size. Based on a simple-to-deploy in-memory data grid.
Preprocessor for acronyms and multi-word jargon phrases. Tuned to life sciences using custom acronym lists as well as acronym and jargon lists developed from UMLS (Unified Medical Language System). Find and mark up acronyms in a text. Find and mark up multi-word jargon phrases in a text. Look up likely expansions for acronyms, given context. Look up likely definitions for multi-word phrases, given context. Emphasis on high throughput, low latency processing. A key capability of this module is to transform acronym and jargon text into some other text that is more likely to turn into a meaningful vector encoding in Topsy.
Primary feature is a predictive model trained to predict whether or not an article will become highly cited within seven years of publication, given only facts available in the first year after publication. Trades off recall for precision. On validation data, it achieved 90% recall / 33% precision, which entails an almost 10-fold decrease in how many articles need to be scanned to have a 90% chance of encountering future well-cited articles. It also has code for representing and populating the citation graph of a collection of articles, which enables the computation of PageRank and similar metrics. There is some experimental code for computing features over that citation graph that might be used to identify emerging trends, namely new subdisciplines that are beginning to emerge in the publication record. There is some speculative modeling of citation trajectories for articles as well that could potentially lead to superior (i.e., more predictive) impact factor metrics, among other uses.
Deployed applications using the above components
- Legit: The primary tool produced by Legit was an "automated research assistant". Users first create a workspace for a project and supply a free-text description of what they are working on. The backend immediately matches this text against patents and academic articles deemed to be similar and allows the user to explore these. The user can thumbs up / thumbs down articles and save them to a folder in their workspace, where they can add comments and other annotations or share with other users. The backend continuously monitors all datasets and gives the users newly-discovered matches via a notification mechanism (email or in-app), which the user can then explore, save, annotate, etc. All data generated by user activity is used to drive the Topic Trends Dashboard.
- Expert Finder: User can enter a free-text description of their project or problem, and be presented with a list of possible "experts" in that subject matter. The expertise is determined by first matching the free text against patents and articles, retrieving the authors of these items, and ranking the authors based on authorship position, inventor status, and proprietary criteria.
- Topic Trends Dashboard: The backend system kept track of what members of an organization were writing in project descriptions and what literature and patents they were exploring. This dashboard provided insights into topics that appeared repeatedly in user write-ups and in explored literature. Dashboard users could view plots of how common each topic was through time and drill down into which users were writing about which topics. Individual topics could be explored, with "experts" recommended and "competitors" highlighted based on authorship, inventorship, and assigneeship (the latter two in the case of patents).
- Adapting Google Patent Results: A web browser plugin that allowed users to search for patents using Google Patents, but then re-ranked and augmented the Google Patents search results with additional information supplied by the backend.
- Explore: A simple tool allowing users to explore the patent and PubMed literature corpora via free text. User enters a description of something they are interested in, and is shown patents and articles considered to be conceptually similar. They can choose one or more of these patents or articles and re-generate results, and repeat as long as desired. Along the way users can save patents and articles they find interesting. Users can branch out their exploration from any of the saved articles or patents.
Potential application areas
- Conceptual Information Retrieval: find text items that match user-supplied text on a conceptual level, as opposed to a word or synonym level
- Active Literature View: maintain a set of text fragments that are actively matched against newly-appearing technical texts, with alerting
- Idea Landscaping: situate an idea, as expressed in text, in a space of other ideas represented by text fragments, to determine if it lies in a crowded, empty, or mixed part of the space and to explore what ideas are nearby
- Computer Aided Authoring: see examples of similar technical text in real time while authoring. See suggested completions of acronyms and jargon phrases along with definitions. An IDE for domain-specific technical writing.
- Emerging Trends Detection: identify articles that are likely to accumulate a lot of citations. Match these with similar articles that are also likely to receive a lot of citations. Summarize the set with key phrases
- Adaptation to Idiolect: adapt the output of matching to a person or organization's specific word usage patterns using feedback about match quality provided by users as they interact with a search or exploration tool.