URL class doesn't provide standard fields


#1

JUCE::URL doesn’t have a way to parse out the standard fields. It provides only scheme, host (called “domain”), path (called “subPath”), and query (called “parameters”); for anything else, the only way to get it is to call toString and then use some other parsing library.

Since the only thing I need is the port (actually, I need either host:port or domain:port, so I can generate cross-domain scripting rules for a given URL, but with the port and what’s already there, I can do the rest easily), I’ve just locally added a getPort method, implemented as:

    int start = findStartOfDomain (url);
    while (url[start] == '/')
        ++start;

    const int end1 = url.indexOfChar (start, '/');
    const int end2 = url.indexOfChar (start, ':');

    if (end2 == -1 || (end2 >= end1)) return String::empty;
    return url.substring (end2+1, end1);

However, ideally, a URL class should provide the complete set of URL components. Also, it would be better to name them correctly (using “parameters” to mean “query” is especially confusing, because there’s a different field with that name).

It might be simpler to just look at widely-used existing URL parsers for a design. They all cut across the components at different depths (e.g., return the whole net_loc as a string, or break it down into userinfo and hostport, or all the way down to user, password, host, and port), and provide different sets of other extras (e.g., provide path-style APIs, or return the path as an object that provides them, or give params and query as a list of name-value pairs), and if you just clone a popular API, you don’t need to put too much thought into it. Some examples worth looking at: Python urlparse, Cocoa NSURL, Javascript parseUri, perl URL::Split, C++ cpp-netlib (see basic_uri).

If you want to design it from scratch, URLs are defined by RFC 1738 and RFC 1808 (and URIs by RFC 2396), and HTTP URLs in particular by RFC 1630. For a typical absolute URL (in the “common Internet scheme” or “generic URI” format) like this:

… you can define fields:
[list]
[]scheme = “http” // aka protocol[/]
[]net_loc = “user:pass@host.domain.com:8000” // aka authority
[list]
[
]userinfo = “user:pass”
[list]
[]user = “user”[/]
[]password = “pass”[/][/list][/]
[
]hostport = “host.domain.com:8000
[list]
[]host = “host.domain.com
[list]
[
]domain = “domain.com”[/][/list][/]
[]port = “8000”[/][/list][/][/list][/]
[]path = “/path/to/resource”[/]
[]params = “param=pval”[/]
[]query = “q1=v1&q2=v2”[/]
[]fragment = “frag” // aka anchor[/][/list]


#2

Well, I admit I never really intended the class to be a full-blown standards-compliant URL class - originally it was little more than just a holder for the string, but it grew a few extra features over the years when they were needed, without me ever sitting down and reading-up on the official standard or terminology!

A getPort method would be a good idea - thanks for the hint there, I’ll see if I can slip that in there…


#3

And… it’s a big fail.
URL can hold anything.
Read the second sentence again, please.
Your method will fail as soon as someone use a IPv6 address (you know, the one that reads “2334:afce::”), it’ll also fail if someone use a user:pass combo (like in http://user:pass@someaddress.com) to auto enter the credentials.
It would also fail if it’s not a HTTP url, but a mail address (mailto:anything@anything), or a urn:uuid (urn:uuid:abczuihec.soidzje:23421) and so on.
Writing a full blown and compliant URL parsing class is hard, and worst, there are ambiguity unless you have a DNS class ready to solve them.
Typically, to get the port out of the URL’s authority, you must pass the authority itself to a getaddrinfo function call.
That is the only sane and reliable way (think about user entering a URL, they might not enter the scheme and expect you to guess it)
I’ve implemented a URL parser for my company, so I’ve eat the dust multiple time, so really delegate this to a common library, you’ll get safer and faster results.