URL::isWellFormed() not totally implemented

I think I have a properly working solution using std::regex, intended to cover just the basic http+https cases. I’ve only tested this on Windows, using VS2015 and VS2017.

#include <regex> //You'll have to put this somewhere proper...

bool URL::isWellFormed() const
{
    if (url.isNotEmpty())
    {
        const auto s = toString (true).toStdString();
        const auto r = std::regex ("^(?:http(s)?:\\/\\/)?[\\w.-]+(?:\\.[\\w\\.-]+)+[\\w\\-\\._~:/?#[\\]@!\\$&'\\(\\)\\*\\+,;=.]+$");
        return std::regex_match (s, r);
    }

    return false;
}

Some unit tests:


//==============================================================================
#if JUCE_UNIT_TESTS

struct URLTests  : public UnitTest
{
    URLTests()
        : UnitTest ("URL", "URL")
    {}

    void runTest() override
    {
        beginTest ("Well Formed URLs");
        {
            StringArray wellFormed =
            {
                "https://www.example.com",
                "http://www.example.com",
                "www.example.com",
                "example.com",
                "http://blog.example.com",
                "http://www.example.com/product",
                "http://www.example.com/products?id=1&page=2",
                "http://www.example.com#up",
                "http://255.255.255.255",
                "255.255.255.255",
                "http://valid.com/perl.cgi?key=", //Value doesn't have to be specified according to RFC
                "http://web-site.com/cgi-bin/perl.cgi?key1=value1&key2",
                "http://www.site.com:8008"
            };

            for (const auto& wf : wellFormed)
                expect (URL (wf).isWellFormed());
        }

        beginTest ("Broken URLs");
        {
            const StringArray broken =
            {
                " \t!@<>#%;|/?:&=+$,",  //Reserved and unsafe chars
                "{}|\\^[]`",            //Unwise chars, as per RFC standard
                "http://blog.sergeys.us/beer?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&amp;utm_content=Google+Reader"
            };

            for (const auto& b : broken)
                expect (! URL (b).isWellFormed());
        }
    }
};

static URLTests urlTests;

#endif
1 Like

What about other schemes besides http and https?

By all means, feel free to contribute and grow the regex! It’s just a starting point that covers the typical website cases.

It’s a lot harder than you might think. Here’s a website I found that compare different regex’s for validity:

https://mathiasbynens.be/demo/url-regex

The one that works the best (and still fails one valid URL) is:

^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$

I’m not sure a regex is the best way to go here.

Oh, no no - I definitely realise there are many cases that require something much better than my solution… assuming you want to go crazy with a fully RFC-compliant thing.

Now I can’t speak for you, but I’d rather have something inch closer towards testing validity with a solution that covers most use-cases in JUCE’s context… It’s much better than just “hey if this string isn’t empty, it’s fine.” The regex I provided does just that.

Unfortunately that wouldn’t cover most cases from the registered URI schemes: Uniform Resource Identifier (URI) Schemes

If you have a better idea then by all means provide one.

Why are you so aggressive? I’m merely pointing out some major flaws in your solution.

Right now the URL::isWellFormed () is too lenient (because it’s not really doing anything) and thus everything gets accepted.

Your solution is too restrictive and would actually break existing code. If you create an URL object with a juce::File as it’s construction-argument, your code would break that, since it doesn’t accept “file://” as a valid scheme.

No, I won’t provide a fully working solution for you, since I have no desire to fix this particular function since it doesn’t affect my current projects.

I’m merely trying to help, to avoid problems.

There are so many permutations, new schemes, changing rules (depending on the scheme) that I truly think that as long as the string isn’t empty, you should just accept it as valid.

Sorry, that wasn’t my intention! I’m just trying to indicate the door is open for better alternatives. You sounded interested, and it seems you’re aware of many other situations that my solution would break so was pretty much expecting those to be outlined to your knowledge, at least to some extent outside of the links you shared.

Interesting. Well that should be relatively easy to fix in the regex, assuming that’s even a valid path forward.

I mean, that’s not really fair at all. I’m setting up multiple network driven apps and am trying to use JUCE’s URL convention to catch developer issues - which the solution provided did a few times now.

Then my advice is to separate the scheme and the rest of the URL. Depending on the scheme (http and https) you do your regex, because http and https are well known schemes with well known restrictions you can test for. For everything else you simply accept a non-empty string as valid, because you can’t possibly know and test for all possible permutations or even future schemes.