The Future of Search EnginesSandlin, Anu  | Aug 31, 2018
Search engines have changed the world. They put vast amounts of information at our fingertips. But search engines have their flaws, says iSchool Associate Professor Matthew Lease. Search results are often not as “smart” as we’d like them to be, lacking a true understanding of language and human logic. They can also replicate and deepen the biases embedded in our searches, rather than bringing us new information or insight.
Dr. Lease believes there may be better ways to harness the dual power of computers and human minds to create more intelligent information retrieval (IR) systems, benefiting general search engines, as well as niche ones like those used for medical knowledge or non-English texts. At the 2017 Annual Meeting of the Association for Computational Linguistics in Vancouver, Canada, Dr. Lease and his collaborators from The University of Texas at Austin and Northeastern University presented two papers describing their novel information retrieval systems using research that leverages the supercomputing resources at UT Austin’s Texas Advanced Computer Center.
In one paper, they presented a method that combines input from multiple annotators—humans who hand-label data used to train and evaluate intelligent algorithms—to determine the best overall annotation for a given text. They applied this method to two problems. First, they analyzed free-text research articles describing medical studies to extract details of each study, such as patient condition, demographics, treatments, and outcomes. They also used name-entity recognition to analyze breaking news stories to identify the events, people, and places involved.
“An important challenge in natural language processing is accurately finding important information contained in free-text, which lets us extract it into databases and combine it with other data to make more intelligent decisions and new discoveries,” Dr. Lease said. “We’ve been using crowdsourcing to annotate medical and news articles at scale so that our intelligent systems will be able to more accurately find the key information contained in each article.”
An important challenge in natural language processing is accurately finding important information contained in free-text, which lets us extract it into databases and combine it with other data to make more intelligent decisions and new discoveries
Such annotation has traditionally been performed by in-house, domain experts. However, crowdsourcing has recently become a popular means to acquire large, labeled datasets at lower cost. Predictably, annotations from laypeople are of lower quality than those from domain experts, so it is necessary to estimate the reliability of crowd annotators, and also aggregate individual annotations to come up with a single set of “reference standard” consensus labels.
Lease’s team found that their method was able to train a neural network—a form of artificial intelligence (AI) modeled on the human brain—so it could very accurately predict named entities and extract relevant information in unannotated texts. The new method improves upon existing tagging and training methods. It also provides an estimate of each worker’s label quality, which can be transferred between tasks and is useful for error analysis and intelligently routing tasks—identifying the best person to annotate each particular text.
The group’s second paper addressed the fact that neural models for natural language processing (NLP) often ignore existing resources like WordNet—a lexical database for the English language that groups words into sets of synonyms—or domain-specific ontologies, such as the Unified Medical Language System, which encode knowledge about a given field.
They proposed a method for exploiting these existing linguistic resources via weight sharing to improve NLP models for automatic text classification. For example, their model learns to classify whether or not published medical articles describing clinical trials are relevant to a well-specified clinical question. In weight sharing, similar words share some fraction of a weight, or assigned numerical value. Weight sharing constrains the number of free parameters that a system must learn, thereby increasing the efficiency and accuracy of the neural model, and serving as a flexible way to incorporate prior knowledge. In doing so, they combine the best of human knowledge with machine learning.
“Neural network models have tons of parameters and need lots of data to fit them,” said Lease. “We had this idea that if you could somehow reason about some words being related to other words a priori, then instead of having to have a parameter for each one of those word separately, you could tie together the parameters across multiple words and in that way, need less data to learn the model. It would realize the benefits of deep learning without large data constraints.”
They applied a form of weight sharing to a sentiment analysis of movie reviews and to a biomedical search related to anemia. Their approach consistently yielded improved performance on classification tasks compared to strategies that did not exploit weight sharing. By improving core natural language processing technologies for automatic information extraction and classification of texts, Dr. Lease says web search engines built on these technologies can continue to improve.
Two new faculty members join iSchoolFerguson, John  | Aug 29, 2016
The School of Information has hired two new faculty members whose research is already shaping the interdisciplinary field of information studies.
Assistant Professor Amelia Acker researches the data that people create when they use mobile phones to send text messages, update their Facebook status, or leverage wireless networks in myriad other ways, such as automatically generating GPS coordinates.
“Amelia is emerging as one of the brightest young scholars in digital records and data traces, helping us better understand the transmission of information through time and media,” iSchool Dean and Professor Andrew Dillon said. “She will significantly advance our traditional strengths in archives and records management while enabling new teaching and research opportunities at the intersection of people and technology.”
Assistant Professor Danna Gurari's research interests span computer vision, crowdsourcing, applied machine learning and biomedical image analysis.
“Extracting information from images is an increasingly important challenge in our digital world, and Danna brings a unique mix of computational and crowdsourcing approaches to this problem,” Dillon said. “Her research is already recognized in the biomedical field for its importance, and she will complement our strengths in information discovery and retrieval.”
Gurari’s research has been recognized by the 2015 Researcher Excellence Award from the Boston University computer science department, among other accolades. Prior to joining the iSchool, she was a postdoctoral fellow in UT Austin’s computer science department. Gurari also worked five years in industry, developing software for satellite systems and building custom, high performance, multi-camera image analysis systems for military, industrial and academic applications.
“As an interdisciplinary researcher, I am delighted to join such a richly diverse and intelligent group of professors at the Information School,” she said. “I am excited to join the faculty and work with students on designing systems that accelerate the extraction of information from images and videos.”
Gurari will begin teaching in the Spring 2017 semester.
Acker, who began teaching in Fall 2016, said people’s constant connection to wireless networks is creating vast amounts of data that is transforming culture while raising questions about important issues from government surveillance to the way we read and write on screens. The recipient of a grant from the federal Institute of Museum and Library Services, Acker’s current research also addresses data literacy and digital preservation to support long-term cultural memory, as well as the environmental impact of preserving huge quantities of data.
“There’s a future where all of us will be creating data and metadata, whether we’re intentionally thinking about it or not, just by virtue of carrying a phone,” said Acker, whose award-winning dissertation was a history of the text message as a seminal development in modern, networked culture. “Every time there’s a big jump in technology that allows us to create new information, whether cuneiform tablets or Xerox machines, there’s a huge new change in the ways we remember and understand ourselves as a society. That’s what I’m really interested in right now.”
Acker joins the School of Information from the University of Pittsburgh’s iSchool, where she was lead faculty of the archives program. From 2006-2014 she worked as an archivist, librarian and preservation consultant for libraries and archives in Southern California.
At UT Austin, the iSchool’s commitment to publishing and authorship and its broad curriculum were among factors that drew Acker to Texas, she said, as well as the strength of the school’s archives and conservation programs.
“Historically, the imperative to preserve is something that libraries and museums have been in control of,” Acker said. “As we move toward platforms like Dropbox, Gmail and Instagram, places where we’re constantly creating cultural memory together, how do we think of these new kinds of social media platforms as archives, and how do we make the case or lobby or describe them as such?”
Despite the fact that we are creating more information and more data than ever before, people are also engaging with platforms and products that don’t have long-term storage provisions, Acker said. “There are all sorts of weird things we haven’t really grappled with yet,” she said. “It’s a very exciting time.”