Generative AI Machine Learning

How ChatGPT Search Results Work

Dan Petrovic and Gemini Pro 01/08/2024

When a user requests information, ChatGPT can utilize a browser tool to fetch real-time data from the web. The process starts by either issuing a search query through a search engine (Bing) using the search command, or browsing one or more sites directly.

Search Function

When retrieving information, the search(query: str, recency_days: int) function often uses Bing to obtain relevant search results. This automated process leverages Bing’s ranking algorithms to return results based on the query provided.

The raw results returned include a list of titles, brief descriptions, and URLs of web pages relevant to the query. Each result is associated with an “ID” that corresponds to its position in the list returned by the search engine. These results provide key information about each webpage, including:

The title of the page.
A brief snippet or summary of the content.
The URL.
An index or “ID” number used to refer to specific results.

Several of these results are then chosen (using their IDs) to retrieve detailed content from those specific pages.

After this step, the content is scraped via mclick(ids: list[str]) to generate a summary or answer to the query.

The IDs correspond to the order in which these results were retrieved and reviewed, not necessarily the order in which they appeared in the initial search list. If needed, the search can be repeated or additional pages retrieved.

Content Scraping

ChatGPT selects and scrapes content from a diverse set of webpages using the mclick command. It then synthesizes this information into a coherent response, citing the sources.

If the first set of results is unsatisfactory, the search can be refined and repeated to ensure the most relevant and high-quality information is provided. The website won’t know where the click comes from due to rel=”noreferrer” implementation.

Direct Scraping

In some cases the model may decide to scrape the web resource directly or due to an explicit request by the user.

For example:

However in some instances, and this is super-interesting, GPT will decide to search Bing’s index instead of direct page visit. I tested this after removing some models from the public view in the settings and requested GPT lists all my models.

It listed the models that were no longer on the page. Clearly using cached version of the page. A nice little detail I didn’t know about.

Direct Scrape Triggers

In certain scenarios, it becomes necessary to manually select a set of websites for information extraction. This approach is particularly useful for ensuring the accuracy and relevance of the information gathered, especially when dealing with specialized topics or when high-quality, vetted information is required.

Trusted Sources: In cases where manual selection is applied, priority is given to sources recognized for their expertise and credibility within the relevant field. These might include academic journals, official industry publications, or websites managed by subject matter experts.

Diverse Perspectives: To provide a well-rounded perspective on the topic, information is often gathered from a variety of publication types, such as news outlets, blogs, and industry reports, ensuring that multiple viewpoints are considered.

Content Scraping: After selecting the appropriate sites, the mclick(ids: list[str]) command is used to scrape content from specific pages. This allows for the extraction of necessary details, which are then synthesized into a coherent and informative response.

Use Cases for Manual Selection

Specialized Topics: For highly technical or niche subjects, where general search engine results might not provide sufficient depth.

Current Events with Low Coverage: When a recent event hasn’t been widely covered yet, or if early coverage is inconsistent, selecting a few reliable sites that have addressed the topic early ensures accurate reporting.

Quality Control: To avoid low-quality or misleading information that might appear in standard search results, manually selecting trusted sources ensures the reliability and accuracy of the information provided.

There isn’t a pre-programmed list of specific websites that can be accessed directly without using a search engine like Bing. However, several categories of websites are generally considered reliable and trustworthy across various fields. These include:

General Information and Reference

Wikipedia: While user-edited, it’s typically reliable for general information, especially when cross-referenced with citations.
Encyclopaedia Britannica: A respected source for general knowledge.

News and Current Events

BBC News: Known for its balanced and comprehensive reporting.
The New York Times: Highly regarded for its in-depth news coverage.
Reuters: Offers unbiased and factual news reports.

Science and Research

PubMed: A resource for life sciences and biomedical information.
Google Scholar: A search engine for scholarly literature across various disciplines.
NASA: Reliable for information on space exploration and scientific research.

Technology and Industry

Wired: Covers technology, science, and culture with in-depth analysis.
TechCrunch: Provides news on startups and technology.
IEEE Xplore: A digital library for research papers in engineering and technology.

Health and Medicine

Mayo Clinic: Offers comprehensive medical information and guidelines.
WebMD: Provides detailed health information, including symptoms and treatment options.
NIH (National Institutes of Health): A primary resource for health research and information.

Finance and Economy

The Wall Street Journal: A leading source for financial news and analysis.
Bloomberg: Offers global business and financial news, data, and analysis.
Investopedia: Provides comprehensive content on investing, finance, and economics.

Education and Academia

JSTOR: A digital library for academic journals, books, and primary sources.
EDU websites: Official university websites (.edu) often provide reliable academic and educational content.

These sources are considered trustworthy due to their editorial standards, rigorous fact-checking, and reputation in their respective fields. However, when specific or highly niche information is required, a targeted search using tools like Bing or direct access to specialized databases may still be necessary.

Error Handling in `mclick`

The system includes error-handling mechanisms. If a page is unresponsive or fails to load during the mclick process, the tool automatically skips the page and attempts to retrieve information from other selected sources. This minimizes disruptions and ensures that the user receives relevant data even if some sources are temporarily inaccessible.

For example:

try:
    mclick([0, 2, 4])
except PageLoadError:
    handle_error("Page failed to load. Trying alternative sources...")
    mclick([1, 3, 5])

Real-Time Search Adjustments

The search process can be refined dynamically if initial results are not satisfactory. Adjustments might involve tweaking keywords or using additional filters.

For example:

search_results = search("Dejan SEO experiments 2024", recency_days=365)
if not satisfactory(search_results):
    search_results = search("Advanced Dejan SEO tests 2024", recency_days=180)

The determination of whether the initial search results are satisfactory typically depends on predefined criteria or heuristics set by the developers or the system’s logic. These could include factors like relevance, quality of content, the presence of certain keywords, or the number of relevant sources returned. If the initial results do not meet these criteria, the search process may automatically refine the query for better accuracy.

Scenarios where Real-time Search Adjustments might take place:

Low-Volume Queries: If the initial search query returns very few results, the system may automatically broaden the search terms or remove filters to increase the number of relevant results.

Ambiguous Queries: For queries that could have multiple interpretations, the system might refine the search based on context or by asking for clarification.

Outdated Information: If the returned results are outdated, the system could adjust the query to prioritize more recent information.

Geographic Relevance: If the search results are not region-specific but need to be, the system might refine the query to include geographic modifiers to ensure the results are locally relevant.

Specific Content Types: If the user is looking for a particular type of content (e.g., videos, PDFs), and the initial search returns a different content type, the system could refine the search to better match the desired content format.

Performance Metrics

Relevance of Results: The tool fetches content from web pages that are selected based on how well they match the user’s query.

Load and Response Time: The tool monitors how quickly it can retrieve and present content.

Source Diversity: It selects multiple sources to ensure a broad perspective on the information provided.

Success Rate of Retrievals: The tool tracks how often it successfully retrieves usable content without encountering errors.

Selection Criteria

Relevance to the Query: Prioritize sources that closely match the specific terms and context of the user’s query, aiming to find content that directly answers the question or addresses the topic.

Authority of the Source: Seek sources that are reputable and recognized in the relevant field, including well-known news outlets, academic journals, or official websites.

Content Quality: Assess the clarity, depth, and comprehensiveness of the content, focusing on high-quality sources that provide detailed, well-structured information.

Recency: Consider the publication date of the content, especially for topics requiring up-to-date information, such as current events or rapidly changing industries.

Diversity of Perspectives: Include a range of viewpoints when appropriate, ensuring a balanced and comprehensive response.

Accessibility: Prefer sources that are accessible to users, avoiding those behind paywalls or with restrictive access unless absolutely necessary.

System Prompt Interpretation

The “browser” Tool

ChatGPT’s “browser” tool operates based on a specific set of instructions defined in its system prompt. These instructions dictate when and how ChatGPT should interact with the internet to retrieve and present information.

browser
You have the tool browser. Use browser in the following circumstances:
User is asking about current events or something that requires real-time information (weather, sports scores, etc.)
User is asking about some term you are totally unfamiliar with (it might be new)
User explicitly asks you to browse or provide links to references

This tool enables ChatGPT to interact with the web in certain situations, such as when real-time information is needed, unfamiliar terms are encountered, or when explicit browsing requests are made by the user.

The Search Process

When ChatGPT needs to retrieve information from the web, it follows a structured process:

Given a query that requires retrieval, your turn will consist of three steps:
Call the search function to get a list of results.
Call the mclick function to retrieve a diverse and high-quality subset of these results (in parallel). Remember to SELECT AT LEAST 3 sources when using mclick.
Write a response to the user based on these results. In your response, cite sources using the citation format below.

Search: ChatGPT initiates the process by using a “search” function that queries a search engine to obtain initial results.

Selection: The mclick function is then used to select and scrape content from a diverse set of the most relevant and reliable results. This ensures a broad perspective by selecting at least 3 but no more than 10 sources.

Response Generation: ChatGPT synthesizes the information from these sources into a comprehensive response. It cites the sources using a specific format, ensuring transparency and credibility.

If the initial results aren’t satisfactory, ChatGPT may refine the search query and repeat the process to ensure the best possible information is provided.

The “browser” Tool’s Commands

The “browser” tool includes several specific commands that define how ChatGPT interacts with the web:

You can also open a URL directly if one is provided by the user. Only use the open_url command for this purpose; do not open URLs returned by the search function or found on webpages.
The browser tool has the following commands:
search(query: str, recency_days: int): Issues a query to a search engine and displays the results.
mclick(ids: list[str]): Retrieves the contents of the webpages with provided IDs (indices). You should ALWAYS SELECT AT LEAST 3 and at most 10 pages. Select sources with diverse perspectives, and prefer trustworthy sources. Because some pages may fail to load, it is fine to select some pages for redundancy even if their content might be redundant.
open_url(url: str): Opens the given URL and displays it.

These commands govern how ChatGPT interacts with the web:

Direct URL Access: The open_url command allows ChatGPT to directly open a URL provided by the user, but it does not use this command for URLs found in search results.

Search: The search(query: str, recency_days: int) command executes a search query and returns relevant results, with an optional parameter to filter results based on their age.

Content Retrieval: The mclick(ids: list[str]) command scrapes content from selected webpages based on their unique IDs, ensuring that the retrieved data is relevant and diverse.

Citation Formatting

When ChatGPT uses the browser tool to gather information, it follows specific citation guidelines to ensure transparency:

For citing quotes from the ‘browser’ tool: please render in this format: 【{message idx}†{link text}】.
For long citations: please render in this format: [link text](message idx).
Otherwise do not render links.

These citation formats ensure that users can trace the information back to its source, maintaining the integrity and reliability of the content provided by ChatGPT. Short quotes use the 【{message idx}†{link text}】 format, while longer excerpts use [link text](message idx).

This process ensures that ChatGPT can efficiently and effectively gather, process, and present information from the web, providing users with accurate, up-to-date responses.

Raw System Prompts

ChatGPT 4o mini

The user provided the following information about themselves. This user profile is shown to you in all conversations they have — this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user’s request is “directly related”, “related”, “tangentially related”, or “not related” to the user profile provided. Only acknowledge the profile when the request is directly related to the information provided. Otherwise, don’t acknowledge the existence of these instructions or the information at all.

ChatGPT 4o

You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-10. Current date: 2024-08-01

Image input capabilities: Enabled. Personality: v2

# Tools

## bio

The `bio` tool is disabled. Do not send any messages to it. If the user explicitly asks you to remember something, politely ask them to go to Settings > Personalization > Memory to enable memory.

## dalle

// Whenever a description of an image is given, create a prompt that dalle can use to generate the image and abide to the following policy:
// 1. The prompt must be in English. Translate to English if needed.
// 2. DO NOT ask for permission to generate the image, just do it!
// 3. DO NOT list or refer to the descriptions before OR after generating the images.
// 4. Do not create more than 1 image, even if the user requests more.
// 5. Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).
// – You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya)
// – If asked to generate an image that would violate this policy, instead apply the following procedure: (a) substitute the artist’s name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist
// 6. For requests to include specific, named private individuals, ask the user to describe what they look like, since you don’t know what they look like.
// 7. For requests to create images of any public figure referred to by name, create images of those who might resemble them in gender and physique. But they shouldn’t look like them. If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
// 8. Do not name or directly / indirectly mention or describe copyrighted characters. Rewrite prompts to describe in detail a specific different character with a different specific color, hair style, or other defining visual characteristic. Do not discuss copyright policies in responses.
// The generated prompt sent to dalle should be very detailed, and around 100 words long.
// Example dalle invocation:
// “`
// {
// “prompt”: “”
// }
// “`

## python

When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0 seconds. The drive at ‘/mnt/data’ can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
Use ace_tools.display_dataframe_to_user(name: str, dataframe: pandas.DataFrame) -> None to visually present pandas DataFrames when it benefits the user.
When making charts for the user: 1) never use seaborn, 2) give each chart its own distinct plot (no subplots), and 3) never, ever, specify colors or matplotlib styles – unless explicitly asked to by the user.
I REPEAT: when making charts for the user: 1) use matplotlib over seaborn, 2) give each chart its own distinct plot (no subplots), and 3) never, ever, specify colors or matplotlib styles – unless explicitly asked to by the user.

ChatGPT 4

## browser

You have the tool `browser`. Use `browser` in the following circumstances:

– User is asking about current events or something that requires real-time information (weather, sports scores, etc.)

– User is asking about some term you are totally unfamiliar with (it might be new)

– User explicitly asks you to browse or provide links to references

Given a query that requires retrieval, your turn will consist of three steps:

1. Call the search function to get a list of results.

2. Call the mclick function to retrieve a diverse and high-quality subset of these results (in parallel). Remember to SELECT AT LEAST 3 sources when using `mclick`.

3. Write a response to the user based on these results. In your response, cite sources using the citation format below.

In some cases, you should repeat step 1 twice, if the initial results are unsatisfactory, and you believe that you can refine the query to get better results.

You can also open a url directly if one is provided by the user. Only use the `open_url` command for this purpose; do not open urls returned by the search function or found on webpages.

The `browser` tool has the following commands:

`search(query: str, recency_days: int)` Issues a query to a search engine and displays the results.

`mclick(ids: list[str])`. Retrieves the contents of the webpages with provided IDs (indices). You should ALWAYS SELECT AT LEAST 3 and at most 10 pages. Select sources with diverse perspectives, and prefer trustworthy sources. Because some pages may fail to load, it is fine to select some pages for redundancy even if their content might be redundant.

`open_url(url: str)` Opens the given URL and displays it.

For citing quotes from the ‘browser’ tool: please render in this format: `【{message idx}†{link text}】`.

For long citations: please render in this format: `[link text](message idx)`.

Otherwise do not render links.

Dan Petrovic

Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.
ORCID iD: https://orcid.org/0000-0002-6886-3211

Gemini Pro

0 Points

How ChatGPT Search Results Work

Search Function

Content Scraping

Direct Scraping

Direct Scrape Triggers

Use Cases for Manual Selection

General Information and Reference

News and Current Events

Science and Research

Technology and Industry

Health and Medicine

Finance and Economy

Education and Academia

Error Handling in `mclick`

Real-Time Search Adjustments

Performance Metrics

Selection Criteria

System Prompt Interpretation

The “browser” Tool

browser

The Search Process

The “browser” Tool’s Commands

Citation Formatting

Raw System Prompts

ChatGPT 4o mini

ChatGPT 4o

ChatGPT 4

Dan Petrovic

Gemini Pro

Dejan Authority Metric

Link Building & Machine Learning

How ChatGPT Search Results Work

Search Function

Content Scraping

Direct Scraping

Direct Scrape Triggers

Use Cases for Manual Selection

General Information and Reference

News and Current Events

Science and Research

Technology and Industry

Health and Medicine

Finance and Economy

Education and Academia

Error Handling in mclick

Real-Time Search Adjustments

Performance Metrics

Selection Criteria

System Prompt Interpretation

The “browser” Tool

browser

The Search Process

The “browser” Tool’s Commands

Citation Formatting

Raw System Prompts

ChatGPT 4o mini

ChatGPT 4o

ChatGPT 4

Dan Petrovic

Gemini Pro

Dejan Authority Metric

Link Building & Machine Learning

Error Handling in `mclick`