Claudia Pearce, Engineering/IT Alum of the Year, on Big Data
(This article originally appeared in the UMBC Magazine)
Computers can crunch mind-boggling arrays of data. They can even win quiz shows. But are there more powerful applications of this analytical power yet to come? Claudia Pearce ’89 M.S., ’94 Ph.D., computer science, is the Senior Computer Science Authority at the National Security Agency (NSA). The winner of UMBC’s Alumna of the Year Award in Engineering and Information Technology in 2014, Pearce is diligently seeking the answer to that question.
sp15_Alumni-Stories_watson3Watson is IBM’s Deep Question Answering system. You might recall that when Watson was put to the test against human contestants on the television quiz show, Jeopardy!, the system successfully bested its competitors in providing questions to answers whose associated question was already known. (And won $1,000,000.)
But like any game, Jeopardy! has its rules – and its limits. Along with my colleagues and others in the field who study big data and predictive analytics, I’ve been wondering whether the techniques implemented in Watson could be used as a powerful knowledge discovery tool to find the questions to answers whose associated questions are unknown.
Subspecialties in the fields of computer science and statistics such as knowledge discovery, machine learning, data mining, and information retrieval are commonly applied in medicine and in the natural and physical sciences – and increasingly in the social sciences, advertising, and cybersecurity, too. (It’s often called “computational biology” or “computational advertising.”)
And as the scope of computational practices has increased, the resources needed to perform it have shrunken tremendously. Ten years ago, massive computation was primarily in the areas of physics, astronomy, and biology, where petabytes of data were collected and analyzed using massive high performance computing systems. The advent of Cloud computing technologies –and their increasing public availability – now allows institutions, companies, and users to rent time for large-scale computation without the enormous costs of creating and maintaining supercomputers.
Additionally, programming and data storage paradigms have evolved to make use of the inherent parallelism in many domain applications. This trend has created new applications for computer science that provide individuals and organizations access to a plethora of online information in real time.
Real-time data sources spur not only social media, but online commerce, video streaming, and geolocation. Wireless technologies and smartphones put that information in the palm of our hands.
The power and speed of these technologies have aided the machine learning and data mining techniques at the heart of analytics, from retrieval of simple facts to trends and predictions. Advertising applications, for instance, analyze your click stream and cookies so that ads tailored to your interests appear as you browse in real time.
Yet the process of developing and maintaining analytics has its costs. First, there is labor. It usually requires teams of people to identify and solve problems in various domains. Analysts (who are usually experts in their subject) develop a collection of research questions in their discipline. They are teamed with statisticians, computer scientists, and others to develop and write programs to put the data in a usable form and create machine learning applications, tools, and algorithms. This combination of data and programming combines into analytics designed to answer a question in a given line of inquiry.
Labor isn’t the only cost. Depending on the application, analytics can be reused – but eventually need to be refreshed with new data. This is particularly true when the analytic is designed to be predictive. Real-time financial industry data, for example, requires that credit card fraud models must be recreated at least annually. Computational advertising models must be refreshed weekly.
Outdated analytics become an artifact of the maintenance cycle. Reuse requires knowledge specific to the analytic, data, and inherent question, and is most likely unknown outside of a small development team. So how can we get beyond these limitations and make a simple easy-to-use system that helps people get answers to questions beyond Googling them?
Part of the answer may be found in a series of Beyond Watson workshops and other activities involving university, government, and industry partners, including one held at UMBC in February.
We’re looking at things like natural language, which is at the core of the ultimate computer human interface. Think back to Star Trek, when Spock would say something like: “Computer, what is the probability the Klingons will attack the Enterprise in this sector of the galaxy?” The computer would often engage in a back and forth interaction with Spock, asking for more information or clarification before offering one or more scenarios (and probabilities supporting those scenarios) to aid the crew of the Enterprise in their next move. When we can ask a computer a question and then engage in a dialogue with it in this way, we can freely use computers to their best advantage. This sort of exchange allows us to hone in on an answer (or answers) to our question, supported by knowing both how the information was derived and what level of confidence to place in it.
Envisioning such a model opens up new vistas. We may not be limited to retrieving existing answers from Wikipedia-like text, but use all available data to elicit answers to non-obvious questions, or to queries that have never been asked. We might also dispense with arcane interfaces that are poorly matched to both the task at hand and the needs of users and consumers. Database management systems, for instance, have historically been developed to ask a few specific questions of data, but imposing such a structure on the data makes it very difficult to anticipate (or answer) questions we weren’t thinking of when we built the system.
Pushing our thinking – and technology – to generalize on the automation of building big data analytics could permit us to leverage existing tools and build on them instead of shelving them. And pulling these tools together in yet another set of big data analytics may even allow for the entire enterprise of analytics (newly created systems as well as existing ones) to help us advance particular knowledge.
Your cell phone has a plethora of apps (collected in an “app store”) to help you make the best use of it. Creating a similar “analytics store” for our Question Answering systems will accelerate automation of these processes and encourage crowd sourcing in analytics development. It’s a vision of the future for which technology already exists. By specifying and clarifying what we want big data analytics to accomplish, we can start to build that future now.
Note: The views and opinions expressed are those of Claudia Pearce and do not reflect those of NSA/CSS.
Tags:
Posted: August 16, 2015, 3:57 PM