VOOZH about

URL: https://thenewstack.io/mits-new-ai-data-extraction-system-teaches-surfing-web/

⇱ MIT's New AI Data Extraction System Teaches Itself by Surfing the Web - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2017-01-11 09:40:18
MIT's New AI Data Extraction System Teaches Itself by Surfing the Web
science,
Operations

MIT’s New AI Data Extraction System Teaches Itself by Surfing the Web

Jan 11th, 2017 9:40am by Kimberley Mok
👁 Featued image for: MIT’s New AI Data Extraction System Teaches Itself by Surfing the Web

We live in an age where there is a vast, over-abundance of data available on the web. The problem is that sifting through all of it to find and make sense of whatever is deemed relevant is an incredibly time-consuming task. But it may soon become easier, as Massachusetts Institute of Technology researchers recently revealed in a paper that introduces a new artificial intelligence system that would be capable of learning, on its own, in extracting useful information from online sources.

Recently presented at the conference of the Association for Computational LinguisticsConference on Empirical Methods on Natural Language Processing in Austin, the researchers’ paper describes a new information extraction system that’s able to automatically extract structured information from unstructured machine-readable documents. Put simply, the program can do what humans are good at: When faced with a gap in information or something we don’t understand, we go and search for another document to digest that will add to our understanding or further our knowledge.

“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” said professor Regina Barzilay and senior author of the new paper. “That’s very different from what you or I would do. When you’re reading an article that you can’t understand, you’re going to go on the web and find one that you can understand.”

AI Fills Information Gaps by Itself

That’s what distinguishes this new AI from its predecessors, as it operates in an unconventional way compared to previous models. Typically, machine learning models work within narrowly defined parameters and must be ‘taught’ with many training examples before it can tackle a problem with some measure of success. This new model, however, was trained on very little data, and then set loose to fill the gaps on its own.

Similar to other models, the process involves the AI assigning a “confidence score” to its data classifications, which indicates the statistical probability of whether the classification is correct or not, as compared to the patterns determined from the training data. In contrast to previous system, this new model will automatically perform a web search for more relevant information if the confidence score doesn’t meet a certain threshold. It will then extract pertinent data from the new texts and integrate it with its previous extractions. If the confidence score is still too low, the cycle will begin again.

“We used a technique called reinforcement learning, whereby a system learns through the notion of reward,” explained graduate student Karthik Narasimhan, one of the paper’s co-authors on Digital Trends. “Because there is a lot of uncertainty in the data being merged — particularly where there is contrasting information — we give it rewards based on the accuracy of the data extraction. By performing this action on the training data we provided, the system learns to be able to merge different predictions in an optimal manner, so we can get the accurate answers we seek.”

Analyzing Shootings and Contaminated Food

The researchers employed what is called a deep-Q network (DQN), that is “trained to optimize a reward function that reflects extraction accuracy while penalizing extra effort.”

They tested the information extraction system separately on two tasks. The first was analyzing a collection of data on mass shootings in the United States (macabre, we know, but useful if one is studying the effects of gun control laws), where the system had to extract the name of the shooter, location, the number of wounded and the number of fatalities. The second task involved going through a set of data on food contamination events to extract information on food type, contaminant type and location. In both cases, the team found that the new system outperformed conventionally trained information extractors by about 10 percent.

👁 mit-info-extraction-ai-1

Sample news article of one shooting case, which has both the shooter’s name and number of fatalities, but both pieces of information would need complex extraction tools to analyze them.

👁 mit-info-extraction-ai-2

Two other articles on the same shooting case, retrieved by the information extraction system. The first article gives the number of people killed, while the second article identifies the shooter in an easily extractable form.

The new system could be a boon to accelerating research tasks that may have required more tedious, manual effort from humans previously. Not only would a system like this save time, it could also save lives: the researchers foresee that such a system could be used by healthcare providers, as a tool for aggregating patient histories under a more unified structure, which would improve the quality of care that a patient receives.

In the greater scheme of things, the system is one step toward building what’s called artificial general intelligence, capable of mastering any number of tasks in the way a human might, rather than being an expert at only one domain.

Featured image: Esther’s Follies, Austin Texas. Other images: MIT.

TRENDING STORIES
Kimberley Mok is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate...
Read more from Kimberley Mok
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.