Skip to main content

Introducing Retrieve API: the best-in-class autonomous web information retrieval API

Retrieve API Cover

After we launched the Agent API, we heard from many developers the need for natural language-based web understanding and data extraction capability to complement the action capabilities of the agent for autonomous web browsing. This became especially apparent as we talked to developers building agents for use cases that require web research and structured data extraction. 

The necessity became increasingly evident to us internally at MultiOn, to have a read capability - web understanding - that works in tandem with our write capability - web actions. When we researched the market for a viable way to extract structured data from web pages, we found a clear demand for a performative, easy-to-use API that could understand websites universally using natural language commands. So, we decided to build it ourselves: introducing the Retrieve API – the best-in-class NL web understanding and information retrieval API.

Retrieve API scraping Urban Outfitters autonomously in <3 minutes

Unlike traditional approaches in search engines and APIs, our Retrieve API actively crawls and parses web pages in real-time to fetch structured information with market-leading accuracy and speed. Today, you can use the Retrieve API in tandem with the rest of our Agent API to create fully autonomous web agents that can navigate pages and scrape information in 3 lines of code. You can also leverage a multi-agent approach to retrieve information from multiple sources in parallel. See an easy example of how you can build with the Agent API here.

Use cases

A startup developing an AI-powered personal shopping assistant can now use the Retrieve API to scan e-commerce sites, compare prices, and gather product reviews, providing users with personalized recommendations.

Another startup creating an automated lead generation tool might employ the Retrieve API to crawl company websites, social media profiles, and business directories, compiling detailed prospect lists for sales teams.

A web agent startup focused on content creation can utilize the API to gather trending topics, relevant statistics, and reputable sources across the internet, assisting in the generation of data-driven articles and reports.


To measure the performance of our Retrieve API, we compared it with the best existing LLM-based data extraction APIs out in the market: Firecrawl and Induced.AI, and the most popular open-source library used for LLM data scraping: ScrapeGraph. For each query, we specified the same natural language prompt and output field schema across all models. For ScrapeGraph, we used SmartScraperGraph with GPT-4o. We couldn’t try providers like as they are in a private waitlist access.

Screenshot 2024 06 25 At 955 23am
My File Name 12

We created a test dataset with 50 static websites hosted on Internet Archive across 8 categories: e-commerce, social, news, food, travel, finance, media, technology, and information. For each website, we built 5 scenarios, measuring the agent’s ability to accurately retrieve data fields and links from the web page.

Websites we used

  • E-commerce: Amazon, Craigslist, Ebay, Etsy

  • Social: LinkedIn, Pinterest, Reddit

  • News: Forbes, Wikinews, Yahoo News

  • Food: Doordash, Opentable, Ubereats, Yelp

  • Travel: Airbnb

  • Finance: Coinbase

  • Technology: Hyperstack, Lambda

  • Information: Govinfo,,

Example query

  • Input prompt: “Get the reviews for this laptop. Also give the sentiment of each review, either positive, negative, or other. Give the date as text as written, either the date or number of days ago it was posted.”

  • Output fields: ["title", "author", "date", "text", "stars", "sentiment"]

  • URL: Amazon HP Laptop Listing

Retrieval evaluation

For evaluating retrieval, we want to consider both the accuracy of the retrieved items and the completeness of the retrieval results. The accuracy of the retrieved items is measured by precision, the fraction of retrieved items that are on the page, while completeness is measured by recall, the fraction of requested/available items that are retrieved. Low precision means that the retrieval endpoint is hallucinating or giving inaccurate information, while low recall means that the endpoint is missing information. The F-score is a measure of the overall performance that represents a balance of both of these factors.

Metrics Comparison

The retrieval task is defined by retrieving a collection of items from a website, each of which has a few fields containing information. We measure the performance at both the item level and field level. At the item level, we evaluate whether complete and correct items are returned, while the field level is a measure that doesn’t penalize incomplete items, but measures overall information retrieval performance. Additional implementation details are provided at the end of the post.

We break down the results across different categories of websites in our benchmark and see that our endpoint has the best performance for each category, at both the item-level field-level:

Item Categorized
Field Categorized

Code example

To use the new Retrieve API check out the examples below and our MultiOn Cookbook with curated recipes to get started with the agent:

Here is a simple Python code snippet that combines our Agent with Retrieve API to scrape the H&M website catalog. The code is also available in TypeScript here.

Step 1: Install pip package

Install the multion package with pip:

Step 2: Initialize MultiOn

Import the MultiOn library and initialize a client with an API key (get yours here):

Step 3: Scrape the first page

To scrape the first page of the H&M catalog, we can simply call retrieve. Because H&M dynamically loads the images as the user scrolls down the page, we will use renderJs to ensure image links are included and scrollToBottom to scroll down the page.

Step 4: Scrape multiple pages autonomously

To scrape multiple pages autonomously, we can use retrieve with step to navigate to the next page. To do this, we must first create a session.

Then, we can create a while loop that will keep running until the last page. At each iteration, the agent will retrieve data and step to navigate to the next page.

Extra: Scrape multiple pages in parallel

To massively speed up the scraping process, we can call retrieve for each page simultaneously. This works for H&M because the URL is numbered for each page.

Ready to get started with the Retrieve API today and learn more about use cases tailored to you? Let's chat 🚀


Benchmark creation

To get the ground truth results, we tried retrieving results using a variety of open and closed-source models. We selected the best-performing model across a handful of test websites. We used it to generate 10 retrieval results per site in the dataset, had another model reconcile the results into a single ground truth result, and then manually inspected the results for errors.

Metric evaluation

Field-level recall, precision, f-score: For a particular field to be successfully retrieved, it needs to have the correct value and be part of the correct item. We checked whether or not it was part of the correct item by matching items by a name field like 'name' or 'title'. To check whether a field had the correct value, we evaluated fields by field type–fuzzy matching for text, exact but format-independent evaluation for prices and dates, exact matches for links, etc. We calculated the precision, recall and f-score as follows:

  • TruePositive_field: The field’s item name matches a ground truth item and the field value matches that item’s value.

  • FalsePositive_field: The field value is not empty, and the item’s name doesn’t match or the field’s value doesn’t match the matching item’s value.

  • FalseNegative_field: There is no matching retrieved field for a ground truth field. Either there is no matching item or the retrieved item’s field value is missing, empty, or incorrect.

Field Level Calculations

Item level recall, precision, f-score: To evaluate item-level metrics, we checked whether a retrieved item matched a ground truth item by all fields, leading to a demanding metric that more heavily penalized partial or incomplete results. To calculate this, we did the following:

  • TruePositive_item : A retrieved item matches a ground truth item, across each of the item’s fields.

  • FalsePositive_item: A retrieved item doesn’t match any ground truth items, by all fields.

  • FalseNegative_item: There is a ground truth item with no matching retrieved item with all matching fields.

Item Level Calculations

Precision vs. recall

Is the better performance of our endpoint due to precision or recall? We plot the field-level results below and see that it is both:

Recall Categorized
Precision Categorized