Determining the type of data pointed to by a Url

Article
12/07/2004

Have you ever wanted to know what type of file was being pointed to by a given url before clicking the link? Maybe you are writing an application that needs to filter out certain types of links. A web crawler is a good example of an application which needs to do such link filtering (skip links to graphics, audio, zip files, etc).

In order to check the type of data pointed to by a url, you are going to need to issue a request to the server. Normally, this involves receiving the entire page or file at that location. This can be a time consuming proposition, especially over slow network connections, and defeats the purpose of allowing your application to filter out undesired links.

The solution is to issue a request to the server, asking only for the HTTP headers. This "HEAD" request is small, fast (does not transfer file contents) and provides you with exactly the data your application needs. While there are plenty of interesting headers, the header we are interested in today is "Content-type".

Below is a simple console application that takes a url path and displays the value of the content-type header. Please note: To keep this example as small as possible, only minimal error checking is performed - any real-world implementation would need to do much more than what I show here.

using System;
using System.Net;

class ContentType
{
    public static void Main(String[] args)
    {
        if(args.Length != 1)
        {
            Console.WriteLine("Please specify a url path.");
            return;
        }

        // display the content type for the url
        String url = args[0];
        Console.WriteLine(String.Format("Url : {0}", url));
        Console.WriteLine(String.Format("Type: {0}", GetContentType(url)));
    }

    private static String GetContentType(String url)
    {
        HttpWebResponse response = null;
        String contentType = "";

        try
        {
            // create the request
            HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;

// instruct the server to return headers only
request.Method = "HEAD";

// make the connection
response = request.GetResponse() as HttpWebResponse;

// read the headers
WebHeaderCollection headers = response.Headers;

            // get the content type
            contentType = headers["Content-type"];
        }
        catch(WebException e)
        {
            // we encountered a problem making the request
            // (server unavailable (404), unauthorized (401), etc)
            response = e.Response as HttpWebResponse;

            // return the message from the exception
            contentType = e.Message;
        }
        catch(NotSupportedException)
        {
            // this will be caught if WebRequest.Create encounters a uri
            // that it does not support (ex: mailto)

            // return a friendly error message
            contentType = "Unsupported Uri";
        }
        catch(UriFormatException)
        {
            // the url is not a valid uri

            // return a friendly error message
            contentType = "Malformed Uri";
        }
        finally
        {
            // make sure the response gets closed
            // this avoids leaking connections
            if(response != null)
            {
                response.Close();
            }
        }

return contentType;
}
}

The above code can be compiled and run on either the .NET Framework or the .NET Compact Framework.

Here's a sampling of the content types I received when running the above application against a handful of urls:

text/html
text/html; charset=utf-8
image/gif
application/octet-stream
text/plain

Enjoy!
-- DK

Disclaimer(s):
This posting is provided "AS IS" with no warranties, and confers no rights.

Comments

Anonymous
December 07, 2004
Hmmm... I think your error checking is excellent as-is, and much better than normally seen in real world scenario. You recommend to improve it... how?
Anonymous
December 08, 2004
Thanks for the complement, G. Man. :)

One of the possible improvements that I might make would be in the WebException handler. There are times where it may be appropriate to handle a 404 response differently from a 502, for example.

My main concern is that readers do not conclude that the exceptions I chose to handle (or how I decided to handle them) are the only ones that are needed or appropriate.

-- DK

Partager via

Determining the type of data pointed to by a Url

Comments

Ressources supplémentaires