Scraping: DNS Filter

This code takes the unique URLs (see Hidden Glasgow for an example and checks which ones are pointing to domains that no longer exists (then drops them from the to-be-checked list).

dns-filter.php
<?php
$file_list = "non-image-links.txt";
 
$data = file_get_contents($file_list);
$lines = explode("\n", $data);
$host_list = array();
 
foreach($lines as $url) {
        $url_parts = parse_url($url);
        if(!empty($url_parts['host'])) {
                $host = $url_parts['host'];
                $host_list[$host][] = $url;
        }
}
 
foreach($host_list as $host_name=>$url_list) {
        if(checkdnsrr($host_name, "A")) {
                foreach($url_list as $url) {
                        echo $url."\n";
                }
        } else {
                // you can echo out the failing domains here
                // echo "ERROR: $host_name\n";
        }
}
 
?>

I run this to dump the valid links to a new file:

php dns-filter.php > filtered-url-list.txt