Building a Wikipedia Knowledge Extraction Server with Model Context Protocol

In this cookbook, we'll build a powerful Wikipedia knowledge extraction server using the Model Context Protocol (MCP) and Hyperbrowser. This integration enables AI models to access and extract structured information from Wikipedia through your local machine, dramatically expanding their reference capabilities.

With this setup, you'll be able to give AI models the ability to:

  • Search Wikipedia for relevant articles on any topic
  • Extract complete article content with proper structure
  • Access article edit history to understand how information has evolved
  • Discover new articles through the random article exploration

The Model Context Protocol creates a standardized bridge between AI systems and local tools, enabling assistants to work with dynamic web content they couldn't otherwise access. This approach allows your AI to break free from training cutoff limitations and work with the most current information available on Wikipedia.

Prerequisites

Before starting, you'll need:

  1. A Hyperbrowser API key (sign up at hyperbrowser.ai if you don't have one)
  2. The MCP Python package (pip install mcp)
  3. Pydantic for structured data modeling (pip install pydantic)
  4. Python 3.9+ installed

Store your API key in a .env file or set it as an environment variable as needed for the MCP client.

For Claude Desktop users, you'll need to modify the claude_desktop_config.json like so:

{
"mcpServers": {
"hyperbrowser-wiki": {
"command": "<PATH TO PYTHON>",
"args": ["<PATH TO MAIN.PY>/main.py"],
"env": {
"HYPERBROWSER_API_KEY": "<HYPERBROWSER_API_KEY>"
}
}
}
}

Step 1: Import Libraries and Set Up Environment

We start by importing the necessary packages for our Wikipedia extraction server. The key components include:

  • Hyperbrowser: For automated web extraction and parsing
  • FastMCP: The Model Context Protocol server implementation
  • Pydantic: For creating strongly-typed data models that structure our Wikipedia content
  • urllib.parse: For URL encoding article titles and search queries

These libraries work together to create a robust Wikipedia knowledge system that AI models can discover and use through the MCP protocol.

import os
import json
import urllib.parse
from typing import List, Literal, Optional
from hyperbrowser import Hyperbrowser
from hyperbrowser.models.extract import StartExtractJobParams
from hyperbrowser.models.scrape import StartScrapeJobParams, ScrapeOptions
from pydantic import BaseModel
from mcp.server.fastmcp import FastMCP

The requirements can be installed with -

pip install mcp hyperbrowser pydantic

Step 2: Initialize the MCP Server

Now we initialize our Model Context Protocol server with a meaningful identifier. This identifier is what AI models will use to discover and connect to our Wikipedia tools.

The MCP server is the bridge that exposes our Wikipedia extraction capabilities to AI models using a standardized interface. This standardization is what makes it possible for any MCP-compatible AI to discover and use our tools without custom training or integration work.

mcp = FastMCP("hyperbrowser-wiki")

Step 3: Define Data Models for Wikipedia Content

Before implementing our extraction tools, we need to define structured data models using Pydantic. These models are crucial for several reasons:

  1. They provide explicit schemas that guide Hyperbrowser's extraction process
  2. They ensure all Wikipedia data is consistently structured and validated
  3. They define clear interfaces that AI models can rely on when using our tools

Our model hierarchy includes:

  • WikipediaArticle: For complete article content
  • WikipediaSearchResult: For individual search results
  • WikipediaSearchResultList: A collection of search results
  • WikipediaContent: A union type that can represent either search results or a complete article
  • WikipediaEdit: For tracking changes to articles
  • WikipediaEditHistory: A collection of edits for an article

This rich type system enables precise, structured knowledge extraction from Wikipedia's complex content.

# Define Pydantic models for Wikipedia data
class WikipediaArticle(BaseModel):
title: str
summary: str
content: str
url: str
class WikipediaSearchResult(BaseModel):
title: str
snippet: str
url: str
class WikipediaSearchResultList(BaseModel):
results: List[WikipediaSearchResult]
# Union model for Wikipedia search and article
class WikipediaContent(BaseModel):
type: Literal["search", "article"] # "search" or "article"
search_results: Optional[List[WikipediaSearchResult]] = None
article: Optional[WikipediaArticle] = None
class WikipediaEdit(BaseModel):
editor: str
timestamp: str
summary: str
size_change: Optional[int]
class WikipediaEditHistory(BaseModel):
title: str
edits: List[WikipediaEdit]

Step 4: Create the Wikipedia Search Tool

Our first MCP tool provides a powerful search interface to Wikipedia. This tool:

  1. Takes a search query and constructs a properly formatted Wikipedia search URL
  2. Handles two distinct scenarios automatically:
  • When the search returns multiple results (returning structured search results)
  • When the search matches a specific article (returning the complete article)
  1. Uses a custom system prompt to guide Hyperbrowser's extraction process

This intelligent handling means the AI doesn't need to make separate calls for searching and retrieving articles - the tool automatically determines the appropriate behavior based on Wikipedia's response.

@mcp.tool()
def search_wikipedia(query: str) -> str:
"""Search Wikipedia for articles matching the query"""
hb = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
search_query = urllib.parse.quote_plus(query)
search_url = f"https://en.wikipedia.org/w/index.php?search={search_query}"
resp = hb.extract.start_and_wait(
StartExtractJobParams(
urls=[search_url],
schema=WikipediaContent,
system_prompt="""Your task is to extract information from Wikipedia pages. There are two possible scenarios:
1. Search Results Page:
- Extract all search results including title, snippet, and URL
- Include only actual article results (ignore special pages, categories etc.)
- Limit to the first page of results
2. Direct Article Page (when search exactly matches an article title):
- Extract the full article content including title, introduction, sections, and references
- Do not extract any search results in this case
- Ensure proper handling of article redirects
Set the 'type' field to either "search" or "article" accordingly.
Return structured data matching the WikipediaContent schema.""",
)
)
if resp.data:
return WikipediaContent.model_validate(resp.data).model_dump_json()
else:
raise ValueError("Could not get search results from Wikipedia.")

Step 5: Create the Raw Wikipedia Content Tool

Next, we implement a tool for extracting raw content from Wikipedia articles. Unlike our search tool, this function:

  1. Takes a specific article title rather than a search query
  2. Uses Hyperbrowser's scrape functionality (rather than extract) to obtain the complete, unstructured content
  3. Returns the content in multiple formats, including Markdown and with all links preserved

This approach gives AI models access to the complete, unprocessed Wikipedia article content when they need the full context rather than a structured extraction.

@mcp.tool()
def get_raw_wikipedia(title: str) -> str:
"""Get the raw content of a Wikipedia article by title"""
hb = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
formatted_title = urllib.parse.quote(title.replace(" ", "_"))
search_url = f"https://en.wikipedia.org/wiki/{formatted_title}"
resp = hb.scrape.start_and_wait(
StartScrapeJobParams(
url=search_url, scrape_options=ScrapeOptions(formats=["markdown", "links"])
)
)
if resp.data:
return json.dumps(resp.data.model_dump_json())
else:
raise ValueError("Could not get search results from Wikipedia.")

Step 6: Create the Wikipedia Article Tool

This tool provides a more structured approach to retrieving complete Wikipedia articles. It differs from the raw content tool by:

  1. Using Hyperbrowser's extract functionality to create a structured representation
  2. Conforming to our WikipediaArticle model with clearly defined fields
  3. Providing a cleaner, more organized representation of the article's content

This structured approach makes it easier for AI models to reason about and reference specific parts of Wikipedia articles, while maintaining the full informational content.

@mcp.tool()
def get_wikipedia_article(title: str) -> str:
"""Get the full content of a Wikipedia article by title"""
hb = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
formatted_title = urllib.parse.quote(title.replace(" ", "_"))
article_url = f"https://en.wikipedia.org/wiki/{formatted_title}"
resp = hb.extract.start_and_wait(
StartExtractJobParams(urls=[article_url], schema=WikipediaArticle)
)
if resp.data:
return WikipediaArticle.model_validate(resp.data).model_dump_json()
else:
raise ValueError(f"Could not get article '{title}' from Wikipedia.")

Step 7: Create the Edit History Tool

Understanding how Wikipedia content evolves is crucial for assessing its reliability. This tool enables AI models to access an article's edit history by:

  1. Converting the article title to a properly formatted history URL
  2. Using a specialized system prompt to guide the extraction of edit information
  3. Structuring the edit history according to our WikipediaEditHistory model

This capability allows AI models to assess how recently an article has been updated, identify controversial sections (those with frequent edits), and understand the evolution of knowledge on a topic over time.

@mcp.tool()
def get_wikipedia_edit_summary(title: str) -> str:
"""Get the edit history of a Wikipedia article by title"""
hb = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
formatted_title = urllib.parse.quote(title.replace(" ", "_"))
history_url = (
f"https://en.wikipedia.org/w/index.php?title={formatted_title}&action=history"
)
resp = hb.extract.start_and_wait(
StartExtractJobParams(
urls=[history_url],
schema=WikipediaEditHistory,
system_prompt="""Extract the edit history from this Wikipedia page. Include:
- The timestamp of each edit
- The editor's username or IP
- The edit summary/comment
- The size change (+/- bytes)
""",
)
)
if resp.data:
return WikipediaEditHistory.model_validate(resp.data).model_dump_json()
else:
raise ValueError(
f"Could not get edit history for article '{title}' from Wikipedia."
)

Step 8: Create the Random Article Tool

Serendipitous discovery is a powerful way to expand knowledge. This tool enables AI models to explore Wikipedia randomly by:

  1. Accessing Wikipedia's Special:Random page which redirects to a random article
  2. Extracting the complete content of whatever article is returned
  3. Structuring it according to our WikipediaArticle model

This capability can be particularly valuable for exploration tasks, generating diverse examples, or simply expanding an AI's knowledge in unexpected directions.

@mcp.tool()
def get_random_article() -> str:
"""Get a random Wikipedia article"""
hb = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
random_url = "https://en.wikipedia.org/wiki/Special:Random"
resp = hb.extract.start_and_wait(
StartExtractJobParams(urls=[random_url], schema=WikipediaArticle)
)
if resp.data:
return WikipediaArticle.model_validate(resp.data).model_dump_json()
else:
raise ValueError("Could not get a random article from Wikipedia.")

Step 9: Running the MCP Server

Finally, we'll launch our MCP server to make our Wikipedia extraction tools available to AI models. The server uses stdio (standard input/output) as its transport mechanism, making it compatible with a wide range of AI clients including Claude Desktop, Cline, Cursor, and other MCP-compatible systems.

When an AI model connects to this server, it will automatically discover all five of our Wikipedia tools along with their documentation, parameter types, and return types - all through the standardized MCP protocol.

if __name__ == "main":
# Initialize and run the server
mcp.run(transport="stdio")

Conclusion

In this cookbook, we've built a powerful Wikipedia knowledge extraction system using the Model Context Protocol and Hyperbrowser. This combination enables AI models to access Wikipedia's vast repository of information in ways that would otherwise be impossible due to training cutoffs or API limitations.

By leveraging MCP, we've created a standardized interface that allows any compatible AI to:

  • Search Wikipedia and retrieve relevant articles
  • Extract complete article content in both structured and raw formats
  • Access article edit histories to evaluate information credibility
  • Explore new topics through random article discovery

All without requiring custom training or hardcoded integrations for each specific task.

Next Steps

To take this Wikipedia extraction system further, you might consider:

  • Adding support for multiple languages beyond English Wikipedia
  • Implementing category browsing functionality
  • Creating tools for extracting structured data from infoboxes

The MCP protocol opens up possibilities far beyond Wikipedia - any web-based or local data source can be made available to AI models using this same pattern, dramatically expanding their knowledge capabilities.