URL::isWellFormed() not totally implemented

jrlanglois · June 12, 2019, 7:02pm

I think I have a properly working solution using std::regex, intended to cover just the basic http+https cases. I’ve only tested this on Windows, using VS2015 and VS2017.

#include <regex> //You'll have to put this somewhere proper...

bool URL::isWellFormed() const
{
    if (url.isNotEmpty())
    {
        const auto s = toString (true).toStdString();
        const auto r = std::regex ("^(?:http(s)?:\\/\\/)?[\\w.-]+(?:\\.[\\w\\.-]+)+[\\w\\-\\._~:/?#[\\]@!\\$&'\\(\\)\\*\\+,;=.]+$");
        return std::regex_match (s, r);
    }

    return false;
}

Some unit tests:


//==============================================================================
#if JUCE_UNIT_TESTS

struct URLTests  : public UnitTest
{
    URLTests()
        : UnitTest ("URL", "URL")
    {}

    void runTest() override
    {
        beginTest ("Well Formed URLs");
        {
            StringArray wellFormed =
            {
                "https://www.example.com",
                "http://www.example.com",
                "www.example.com",
                "example.com",
                "http://blog.example.com",
                "http://www.example.com/product",
                "http://www.example.com/products?id=1&page=2",
                "http://www.example.com#up",
                "http://255.255.255.255",
                "255.255.255.255",
                "http://valid.com/perl.cgi?key=", //Value doesn't have to be specified according to RFC
                "http://web-site.com/cgi-bin/perl.cgi?key1=value1&key2",
                "http://www.site.com:8008"
            };

            for (const auto& wf : wellFormed)
                expect (URL (wf).isWellFormed());
        }

        beginTest ("Broken URLs");
        {
            const StringArray broken =
            {
                " \t!@<>#%;|/?:&=+$,",  //Reserved and unsafe chars
                "{}|\\^[]`",            //Unwise chars, as per RFC standard
                "http://blog.sergeys.us/beer?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&amp;utm_content=Google+Reader"
            };

            for (const auto& b : broken)
                expect (! URL (b).isWellFormed());
        }
    }
};

static URLTests urlTests;

#endif

Toddler-Boy · June 12, 2019, 7:38pm

What about other schemes besides http and https?

jrlanglois · June 12, 2019, 7:39pm

By all means, feel free to contribute and grow the regex! It’s just a starting point that covers the typical website cases.

Toddler-Boy · June 12, 2019, 7:45pm

It’s a lot harder than you might think. Here’s a website I found that compare different regex’s for validity:

https://mathiasbynens.be/demo/url-regex

The one that works the best (and still fails one valid URL) is:

^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$

I’m not sure a regex is the best way to go here.

jrlanglois · June 12, 2019, 7:49pm

Oh, no no - I definitely realise there are many cases that require something much better than my solution… assuming you want to go crazy with a fully RFC-compliant thing.

Now I can’t speak for you, but I’d rather have something inch closer towards testing validity with a solution that covers most use-cases in JUCE’s context… It’s much better than just “hey if this string isn’t empty, it’s fine.” The regex I provided does just that.

Unfortunately that wouldn’t cover most cases from the registered URI schemes: Uniform Resource Identifier (URI) Schemes

If you have a better idea then by all means provide one.

Toddler-Boy · June 12, 2019, 8:12pm

Why are you so aggressive? I’m merely pointing out some major flaws in your solution.

Right now the URL::isWellFormed () is too lenient (because it’s not really doing anything) and thus everything gets accepted.

Your solution is too restrictive and would actually break existing code. If you create an URL object with a juce::File as it’s construction-argument, your code would break that, since it doesn’t accept “file://” as a valid scheme.

No, I won’t provide a fully working solution for you, since I have no desire to fix this particular function since it doesn’t affect my current projects.

I’m merely trying to help, to avoid problems.

There are so many permutations, new schemes, changing rules (depending on the scheme) that I truly think that as long as the string isn’t empty, you should just accept it as valid.

jrlanglois · June 12, 2019, 8:19pm

Sorry, that wasn’t my intention! I’m just trying to indicate the door is open for better alternatives. You sounded interested, and it seems you’re aware of many other situations that my solution would break so was pretty much expecting those to be outlined to your knowledge, at least to some extent outside of the links you shared.

Interesting. Well that should be relatively easy to fix in the regex, assuming that’s even a valid path forward.

I mean, that’s not really fair at all. I’m setting up multiple network driven apps and am trying to use JUCE’s URL convention to catch developer issues - which the solution provided did a few times now.

Toddler-Boy · June 12, 2019, 8:28pm

Then my advice is to separate the scheme and the rest of the URL. Depending on the scheme (http and https) you do your regex, because http and https are well known schemes with well known restrictions you can test for. For everything else you simply accept a non-empty string as valid, because you can’t possibly know and test for all possible permutations or even future schemes.

Topic		Replies	Views
A better URL::isWellFormed() General JUCE discussion	3	306	June 14, 2025
Juce URL without scheme not returning expected values General JUCE discussion	4	742	June 17, 2016
URL class General JUCE discussion	1	300	December 31, 2008
Juce URL General JUCE discussion	13	1317	May 12, 2017
URL with IP address General JUCE discussion	1	406	May 12, 2020

URL::isWellFormed() not totally implemented

Purchase

Discover

Learn

Support

About

Events

URL::isWellFormed() not totally implemented

Related topics

Purchase

Discover

Learn

Support

About

Events