Using Notepad++ to Pull a List of URLs from an HTML Document
Recently I needed a way to take all the URLs from an HTML document and paste them into a spreadsheet. There were far too many to do by hand; I needed some sort of automated solution. After some Googling, I found what I was looking for on superuser.com: Paste your text into Notepad++ and use bookmarking to remove all but the strings you want to keep. I took those steps and adapted them for my purposes:
Copy and paste the HTML markup into Notepad++.
Select
Search > Mark...
, and then checkBookmark line
.Under "Search Mode", make sure
Regular expression
is selected.For "Find what:", enter:
^.(?:href="http).$
Select
Mark All
.Select
Close
. All the lines containing href links will now be highlighted.Select
Search > Bookmark > Removed Unmarked Lines
. This will remove all but the highlighted (bookmarked) lines.You'll still have extraneous markup to remove. To begin removing all text except for the URLs, press
Ctrl-H
.Make sure
Regular expression
is selected.For "Find what", enter:
^.href="(.?)".?>.?$
For "Replace with", enter:
$1
Select
Replace All
.
You should now have a clean list of URLs.