User Tools

Site Tools


scrape:hidden_glasgow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
scrape:hidden_glasgow [2020/11/20 17:22]
admin created
scrape:hidden_glasgow [2020/11/20 18:10] (current)
admin
Line 33: Line 33:
 Down to 62.5k links. ​ I'm not interested in the images at the moment, I'll deal with them later maybe: Down to 62.5k links. ​ I'm not interested in the images at the moment, I'll deal with them later maybe:
 <code bash> <code bash>
-grep -vi -E "​(gif|jpg)$"​ http_links.txt > non-image-links.txt+grep -vi -E "​(gif|jpg|png)$" http_links.txt > non-image-links.txt
 </​code>​ </​code>​
-22,​156 ​links left.  Now I'll work through them, first I'll run them through a script that will find which domains no longer exists and remove them.  Then it'll search for 404s and move those into a seperate list.  ​+22k links left.  Now I'll work through them, first I'll run them through a [[Scraping/​DNS filter]] ​script that will find which domains no longer exists and remove them.  Then it'll search for 404s and move those into a seperate list.  ​
  
-Check out the [[Scraping]] page for some of the tools I use.+Check out the [[/Scraping]] page for more information.
scrape/hidden_glasgow.1605892921.txt.gz · Last modified: 2020/11/20 17:22 by admin