Wikipedia has covered pretty much every topic your mind can conceive. Another thing Wikipedia is good at is internally linking from one page to another. Wouldn't it be great to have this corpus of knowledge as a reference point? I was doing some research & found out that there happens to be a Wikipedia API Python Library with various kinds of functions that can help you visualize their data to be used as a reference point. 🤯 There is a lot that can achieved. But in this post, I will be covering how you can utilize Wikipedia API Python Library to visualize the Topical Map for a given Wikipedia page, and how it emerges from there and connects to various topics. In this Python Script for visualization, we will be leveraging Plotly Sankey Chart Visualization. Here is the script without further ado! ## **Step 1 - Install the necessary libraries** !pip install wikipedia-api plotly nltk We are installing three libraries with this line of code, wikipedia-api, Plotly & NLTK NLTK which is a NLP Python Library is helping us with the data cleaning part. It helps us signify the stopwords to improve the output quality. ## **Step 2 - Imports & Functions** import wikipediaapi import plotly.graph_objects as go from urllib.parse import unquote from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import nltk # Download necessary NLTK data nltk.download('punkt', quiet=True) nltk.download('stopwords', quiet=True) nltk.download('wordnet', quiet=True) def preprocess_text(text): tokens = word_tokenize(text.lower()) stop_words = set(stopwords.words('english')) tokens = [token for token in tokens if token.isalpha() and token not in stop_words] lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(token) for token in tokens] return set(tokens) def calculate_relevance(source_title, target_title): source_tokens = preprocess_text(source_title) target_tokens = preprocess_text(target_title) common_tokens = source_tokens.intersection(target_tokens) return len(common_tokens) / max(len(source_tokens), len(target_tokens)) def is_valid_page(title): excluded_prefixes = [ "Wikipedia talk:", "Talk:", "User:", "User talk:", "Category:", "Template:", "Help:", "File:" ] return not any(title.startswith(prefix) for prefix in excluded_prefixes) def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1): title = unquote(page_url.split("/")[-1].replace("_", " ")) wiki = wikipediaapi.Wikipedia('MyProjectName (merlin@example.com)', 'en') def fetch_links(page_title, current_depth): page = wiki.page(page_title) if not page.exists(): return {} links = list(page.links.keys()) relevant_links = [ link for link in links if calculate_relevance(page_title, link) >= relevance_threshold and is_valid_page(link) ] relevant_links = relevant_links[:max_links] result = {page_title: relevant_links} if current_depth < depth: for link in relevant_links: result.update(fetch_links(link, current_depth + 1)) return result return fetch_links(title, 0) def create_sankey_data(links_dict): nodes = list(links_dict.keys()) for sublinks in links_dict.values(): nodes.extend(sublinks) nodes = list(dict.fromkeys(nodes)) # Remove duplicates while preserving order node_indices = {node: i for i, node in enumerate(nodes)} source = [] target = [] value = [] for page, sublinks in links_dict.items(): for sublink in sublinks: source.append(node_indices[page]) target.append(node_indices[sublink]) value.append(1) return nodes, source, target, value def create_sankey_chart(nodes, source, target, value): fig = go.Figure(data=[go.Sankey( node = dict( pad = 15, thickness = 20, line = dict(color = "black", width = 0.5), label = nodes, color = "blue" ), link = dict( source = source, target = target, value = value ))]) fig.update_layout(title_text="Semantically Filtered Wikipedia Page Links Sankey Diagram (Excluding Talk Pages)", font_size=10) return fig def main(page_url): links_dict = get_page_links(page_url) nodes, source, target, value = create_sankey_data(links_dict) fig = create_sankey_chart(nodes, source, target, value) fig.show() if __name__ == "__main__": page_url = "https://en.wikipedia.org/wiki/Insurance_fraud" main(page_url) In the above code block, we have successfully imported functions, and leveraged NLTK for stop words & relevancy specifications. However, NLTK wasn't enough which is why we also specified that we don't want to see visualizations for the following "Wikipedia talk:","Talk:","User:","User talk:","Category:","Template:","Help:","File:" With this line def get_page_links(page_url, depth=5, max_links=2, relevance_threshold=0.1): We are specifying the depth at which we want to go and the maximum number of links we want to process for each page. By putting a lesser threshold here we are decluttering the visualization otherwise it becomes way too cluttered. Here is the visualization we can see, which resembles a topical map reference point. You can also drag & hover these nodes to see their relationship with other nodes, it also on hover displays incoming & outgoing link flow. The entire script takes about a few minutes to be executed & very easily you can observe the topical map in a bird's eye view. **SEO Use Case:** Your core topic page that you have been struggling to rank, you can find its Wikipedia page & see how the topical map is emerging from that page, this will help you see if there are any topical map gaps which you need to address.