This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
scrape:hidden_glasgow [2020/11/20 17:38] admin |
scrape:hidden_glasgow [2020/11/20 18:10] (current) admin |
||
---|---|---|---|
Line 35: | Line 35: | ||
grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt | grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt | ||
</code> | </code> | ||
- | 22k links left. Now I'll work through them, first I'll run them through a script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list. | + | 22k links left. Now I'll work through them, first I'll run them through a [[Scraping/DNS filter]] script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list. |
- | Check out the [[Scraping]] page for some of the tools I use. | + | Check out the [[/Scraping]] page for more information. |