Using Notepad++ to Pull a List of URLs from an HTML Document
Recently I needed a way to take all the URLs from an HTML document and paste them into a spreadsheet. There were far too many to do by hand; I needed some sort of automated solution. After some Googling, I found what I was looking for on superuser.com: Paste your text into Notepad++ and use bookmarking to remove all but the strings you want to keep. I took those steps and adapted them for my purposes:
Copy and paste the HTML markup into Notepad++.
Select
Search > Mark..., and then checkBookmark line.Under "Search Mode", make sure
Regular expressionis selected.For "Find what:", enter:
^.(?:href="http).$Select
Mark All.Select
Close. All the lines containing href links will now be highlighted.Select
Search > Bookmark > Removed Unmarked Lines. This will remove all but the highlighted (bookmarked) lines.You'll still have extraneous markup to remove. To begin removing all text except for the URLs, press
Ctrl-H.Make sure
Regular expressionis selected.For "Find what", enter:
^.href="(.?)".?>.?$For "Replace with", enter:
$1Select
Replace All.
You should now have a clean list of URLs.