Visualize Topical Maps Using Python and Wikipedia API

Last updated on September 22nd, 2024 at 06:19 am

Wikipedia has covered pretty much every topic your mind can conceive. Another thing Wikipedia is good at is internally linking from one page to another.

Wouldn’t it be great to have this corpus of knowledge as a reference point?

I was doing some research & found out that there happens to be a Wikipedia API Python Library with various kinds of functions that can help you visualize their data to be used as a reference point. 🤯

There is a lot that can achieved. But in this post, I will be covering how you can utilize Wikipedia API Python Library to visualize the Topical Map for a given Wikipedia page, and how it emerges from there and connects to various topics.

In this Python Script for visualization, we will be leveraging Plotly Sankey Chart Visualization.

Here is the script without further ado!

Step 1 – Install the necessary libraries

				
					!pip install wikipedia-api plotly nltk

We are installing three libraries with this line of code, wikipedia-api, Plotly & NLTK

NLTK which is a NLP Python Library is helping us with the data cleaning part. It helps us signify the stopwords to improve the output quality.

Step 2 – Imports & Functions

				
					import wikipediaapi
import plotly.graph_objects as go
from urllib.parse import unquote
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return set(tokens)

def calculate_relevance(source_title, target_title):
    source_tokens = preprocess_text(source_title)
    target_tokens = preprocess_text(target_title)
    common_tokens = source_tokens.intersection(target_tokens)
    return len(common_tokens) / max(len(source_tokens), len(target_tokens))

def is_valid_page(title):
    excluded_prefixes = [
        "Wikipedia talk:",
        "Talk:",
        "User:",
        "User talk:",
        "Category:",
        "Template:",
        "Help:",
        "File:"
    ]
    return not any(title.startswith(prefix) for prefix in excluded_prefixes)

def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1):
    title = unquote(page_url.split("/")[-1].replace("_", " "))
    wiki = wikipediaapi.Wikipedia('MyProjectName (merlin@example.com)', 'en')
    
    def fetch_links(page_title, current_depth):
        page = wiki.page(page_title)
        if not page.exists():
            return {}
        
        links = list(page.links.keys())
        relevant_links = [
            link for link in links 
            if calculate_relevance(page_title, link) >= relevance_threshold
            and is_valid_page(link)
        ]
        relevant_links = relevant_links[:max_links]
        
        result = {page_title: relevant_links}
        
        if current_depth < depth:
            for link in relevant_links:
                result.update(fetch_links(link, current_depth + 1))
        
        return result
    
    return fetch_links(title, 0)

def create_sankey_data(links_dict):
    nodes = list(links_dict.keys())
    for sublinks in links_dict.values():
        nodes.extend(sublinks)
    nodes = list(dict.fromkeys(nodes))  # Remove duplicates while preserving order
    
    node_indices = {node: i for i, node in enumerate(nodes)}
    
    source = []
    target = []
    value = []
    
    for page, sublinks in links_dict.items():
        for sublink in sublinks:
            source.append(node_indices[page])
            target.append(node_indices[sublink])
            value.append(1)
    
    return nodes, source, target, value

def create_sankey_chart(nodes, source, target, value):
    fig = go.Figure(data=[go.Sankey(
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(color = "black", width = 0.5),
          label = nodes,
          color = "blue"
        ),
        link = dict(
          source = source,
          target = target,
          value = value
      ))])

    fig.update_layout(title_text="Semantically Filtered Wikipedia Page Links Sankey Diagram (Excluding Talk Pages)", font_size=10)
    return fig

def main(page_url):
    links_dict = get_page_links(page_url)
    nodes, source, target, value = create_sankey_data(links_dict)
    fig = create_sankey_chart(nodes, source, target, value)
    fig.show()

if __name__ == "__main__":
    page_url = "https://en.wikipedia.org/wiki/Insurance_fraud"
    main(page_url)

In the above code block, we have successfully imported functions, and leveraged NLTK for stop words & relevancy specifications.

However, NLTK wasn’t enough which is why we also specified that we don’t want to see visualizations for the following

“Wikipedia talk:”,

“Talk:”,

“User:”,

“User talk:”,

“Category:”,

“Template:”,

“Help:”,

“File:”

With this line

def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1):

We are specifying the depth at which we want to go and the maximum number of links we want to process for each page.

By putting a lesser threshold here we are decluttering the visualization otherwise it becomes way too cluttered.

Here is the visualization we can see, which resembles a topical map reference point.

You can also drag & hover these nodes to see their relationship with other nodes, it also on hover displays incoming & outgoing link flow.

The entire script takes about a few minutes to be executed & very easily you can observe the topical map in a bird’s eye view.

SEO Use Case:

Your core topic page that you have been struggling to rank, you can find its Wikipedia page & see how the topical map is emerging from that page, this will help you see if there are any topical map gaps which you need to address.

Kunjal

Kunjal Chawhan founder of Decode Digital Market, a Digital Marketer by profession, and a Digital Marketing Niche Blogger by passion, here to share my knowledge

Sharing is awesome

Twitter Facebook Pinterest LinkedIn WhatsApp Pocket

Step 1 – Install the necessary libraries

Step 2 – Imports & Functions

Sharing is awesome

Leave a Comment Cancel reply