Introducing Retrieve API: the best-in-class autonomous web information retrieval API
After we launched the Agent API, we heard from many developers the need for natural language-based web understanding and data extraction capability to complement the action capabilities of the agent for autonomous web browsing. This became especially apparent as we talked to developers building agents for use cases that require web research and structured data extraction.
The necessity became increasingly evident to us internally at MultiOn, to have a read capability - web understanding - that works in tandem with our write capability - web actions. When we researched the market for a viable way to extract structured data from web pages, we found a clear demand for a performative, easy-to-use API that could understand websites universally using natural language commands. So, we decided to build it ourselves: introducing the Retrieve API – the best-in-class NL web understanding and information retrieval API.
Unlike traditional approaches in search engines and APIs, our Retrieve API actively crawls and parses web pages in real-time to fetch structured information with market-leading accuracy and speed. Today, you can use the Retrieve API in tandem with the rest of our Agent API to create fully autonomous web agents that can navigate pages and scrape information in 3 lines of code. You can also leverage a multi-agent approach to retrieve information from multiple sources in parallel. See an easy example of how you can build with the Agent API here.
Use cases
A startup developing an AI-powered personal shopping assistant can now use the Retrieve API to scan e-commerce sites, compare prices, and gather product reviews, providing users with personalized recommendations.
Another startup creating an automated lead generation tool might employ the Retrieve API to crawl company websites, social media profiles, and business directories, compiling detailed prospect lists for sales teams.
A web agent startup focused on content creation can utilize the API to gather trending topics, relevant statistics, and reputable sources across the internet, assisting in the generation of data-driven articles and reports.
Benchmarks
To measure the performance of our Retrieve API, we compared it with the best existing LLM-based data extraction APIs out in the market: Firecrawl and Induced.AI, and the most popular open-source library used for LLM data scraping: ScrapeGraph. For each query, we specified the same natural language prompt and output field schema across all models. For ScrapeGraph, we used SmartScraperGraph with GPT-4o. We couldn’t try providers like reworkd.ai as they are in a private waitlist access.
We created a test dataset with 50 static websites hosted on Internet Archive across 8 categories: e-commerce, social, news, food, travel, finance, media, technology, and information. For each website, we built 5 scenarios, measuring the agent’s ability to accurately retrieve data fields and links from the web page.
Websites we used
E-commerce: Amazon, Craigslist, Ebay, Etsy
Social: LinkedIn, Pinterest, Reddit
News: Forbes, Wikinews, Yahoo News
Food: Doordash, Opentable, Ubereats, Yelp
Travel: Airbnb
Finance: Coinbase
Technology: Hyperstack, Lambda
Information: Govinfo, Idaho.gov, Weather.com
Example query
Input prompt: “Get the reviews for this laptop. Also give the sentiment of each review, either positive, negative, or other. Give the date as text as written, either the date or number of days ago it was posted.”
Output fields: ["title", "author", "date", "text", "stars", "sentiment"]
URL: Amazon HP Laptop Listing
Retrieval evaluation
For evaluating retrieval, we want to consider both the accuracy of the retrieved items and the completeness of the retrieval results. The accuracy of the retrieved items is measured by precision, the fraction of retrieved items that are on the page, while completeness is measured by recall, the fraction of requested/available items that are retrieved. Low precision means that the retrieval endpoint is hallucinating or giving inaccurate information, while low recall means that the endpoint is missing information. The F-score is a measure of the overall performance that represents a balance of both of these factors.
The retrieval task is defined by retrieving a collection of items from a website, each of which has a few fields containing information. We measure the performance at both the item level and field level. At the item level, we evaluate whether complete and correct items are returned, while the field level is a measure that doesn’t penalize incomplete items, but measures overall information retrieval performance. Additional implementation details are provided at the end of the post.
We break down the results across different categories of websites in our benchmark and see that our endpoint has the best performance for each category, at both the item-level field-level:
Code example
To use the new Retrieve API check out the examples below and our MultiOn Cookbook with curated recipes to get started with the agent:
Scrape H&M (code snippet below)
Here is a simple Python code snippet that combines our Agent with Retrieve API to scrape the H&M website catalog. The code is also available in TypeScript here.
Step 1: Install pip package
Install the multion package with pip:
Step 2: Initialize MultiOn
Import the MultiOn library and initialize a client with an API key (get yours here):
Step 3: Scrape the first page
To scrape the first page of the H&M catalog, we can simply call retrieve. Because H&M dynamically loads the images as the user scrolls down the page, we will use renderJs to ensure image links are included and scrollToBottom to scroll down the page.
Step 4: Scrape multiple pages autonomously
To scrape multiple pages autonomously, we can use retrieve with step to navigate to the next page. To do this, we must first create a session.
Then, we can create a while loop that will keep running until the last page. At each iteration, the agent will retrieve data and step to navigate to the next page.
Extra: Scrape multiple pages in parallel
To massively speed up the scraping process, we can call retrieve for each page simultaneously. This works for H&M because the URL is numbered for each page.
Ready to get started with the Retrieve API today and learn more about use cases tailored to you? Let's chat 🚀
Appendix
Benchmark creation
To get the ground truth results, we tried retrieving results using a variety of open and closed-source models. We selected the best-performing model across a handful of test websites. We used it to generate 10 retrieval results per site in the dataset, had another model reconcile the results into a single ground truth result, and then manually inspected the results for errors.
Metric evaluation
Field-level recall, precision, f-score: For a particular field to be successfully retrieved, it needs to have the correct value and be part of the correct item. We checked whether or not it was part of the correct item by matching items by a name field like 'name' or 'title'. To check whether a field had the correct value, we evaluated fields by field type–fuzzy matching for text, exact but format-independent evaluation for prices and dates, exact matches for links, etc. We calculated the precision, recall and f-score as follows:
TruePositive_field: The field’s item name matches a ground truth item and the field value matches that item’s value.
FalsePositive_field: The field value is not empty, and the item’s name doesn’t match or the field’s value doesn’t match the matching item’s value.
FalseNegative_field: There is no matching retrieved field for a ground truth field. Either there is no matching item or the retrieved item’s field value is missing, empty, or incorrect.
Item level recall, precision, f-score: To evaluate item-level metrics, we checked whether a retrieved item matched a ground truth item by all fields, leading to a demanding metric that more heavily penalized partial or incomplete results. To calculate this, we did the following:
TruePositive_item : A retrieved item matches a ground truth item, across each of the item’s fields.
FalsePositive_item: A retrieved item doesn’t match any ground truth items, by all fields.
FalseNegative_item: There is a ground truth item with no matching retrieved item with all matching fields.
Precision vs. recall
Is the better performance of our endpoint due to precision or recall? We plot the field-level results below and see that it is both: