This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
scrape:hidden_glasgow [2020/11/20 17:22] admin created |
scrape:hidden_glasgow [2020/11/20 18:10] (current) admin |
||
---|---|---|---|
Line 33: | Line 33: | ||
Down to 62.5k links. I'm not interested in the images at the moment, I'll deal with them later maybe: | Down to 62.5k links. I'm not interested in the images at the moment, I'll deal with them later maybe: | ||
<code bash> | <code bash> | ||
- | grep -vi -E "(gif|jpg)$" http_links.txt > non-image-links.txt | + | grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt |
</code> | </code> | ||
- | 22,156 links left. Now I'll work through them, first I'll run them through a script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list. | + | 22k links left. Now I'll work through them, first I'll run them through a [[Scraping/DNS filter]] script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list. |
- | Check out the [[Scraping]] page for some of the tools I use. | + | Check out the [[/Scraping]] page for more information. |