Victoria's Notes

Using Notepad++ to Pull a List of URLs from an HTML Document

Recently I needed a way to take all the URLs from an HTML document and paste them into a spreadsheet. There were far too many to do by hand; I needed some sort of automated solution. After some Googling, I found what I was looking for on superuser.com: Paste your text into Notepad++ and use bookmarking to remove all but the strings you want to keep. I took those steps and adapted them for my purposes:

  1. Copy and paste the HTML markup into Notepad++.

  2. Select Search > Mark..., and then check Bookmark line.

  3. Under "Search Mode", make sure Regular expression is selected.

  4. For "Find what:", enter: ^.(?:href="http).$

  5. Select Mark All.

  6. Select Close. All the lines containing href links will now be highlighted.

  7. Select Search > Bookmark > Removed Unmarked Lines. This will remove all but the highlighted (bookmarked) lines.

  8. You'll still have extraneous markup to remove. To begin removing all text except for the URLs, press Ctrl-H.

  9. Make sure Regular expression is selected.

  10. For "Find what", enter: ^.href="(.?)".?>.?$

  11. For "Replace with", enter: $1

  12. Select Replace All.

You should now have a clean list of URLs.

#notepad++