Scrape Menu Links with a Simple Python Script

In this blog post, I will share a Google Collab Python Notebook that will help you get a mapping of all the header navigation menu links mapped with their anchor text that they are using.

Why even create such a Python Script?

Recently, I faced this issue wherein I wanted to understand what all links were present in the header navigation menu & this site had a huge amount of links present in the header navigation menu.

That is when it occurred to me that I can expedite my workflow by utilising Python.

Here is the step by step process to achieve this.

Step 1: Installations

				
					!pip install beautifulsoup4 pandas lxml
				
			

Step 2: Paste Header HTML Code that contains the Header Menu

				
					# Paste the HTML snippet inside the triple quotes
html_snippet = """
entire code
"""
				
			

Step 3: Imports & specifying the hostname in case the site is using relative links instead of absolute

				
					from bs4 import BeautifulSoup
import pandas as pd

# Base domain to prepend to relative URLs
base_url = "https://www.ebay.com"

# Parse HTML
soup = BeautifulSoup(html_snippet, "lxml")

# Find all anchor tags
anchors = soup.find_all("a")

# Extract href and anchor text
data = []
for a in anchors:
    href = a.get("href")
    text = a.get_text(strip=True)
    full_url = base_url + href if href.startswith("/") else href
    data.append({"URL": full_url, "Anchor Text": text})

# Create DataFrame
df = pd.DataFrame(data)

# Display DataFrame
df
				
			

Step 4: Preview the Table Dataframe

header links dataframe

Step 5: Download the results in a CSV

				
					from google.colab import files

# Save the DataFrame to a CSV file
csv_filename = "anchor_links.csv"
df.to_csv(csv_filename, index=False)

# Trigger download
files.download(csv_filename)
				
			

Leave a Comment