What’s the problem with this regex, made to match a YouTube URL and return an ID? (From CloudBot)
(I came across this trying to break an IRC bot based on CloudBot, of course!)
Let’s follow a specific path to simplify the regex:
/(?:youtube.*?v=)([-_a-za-Z0-9]+)/i (this part is meant to match e.g.
http://youtube.com/watch?v=$VIDEO_ID) Basically, the regex is very sloppy, resulting in it matching too much, for example:
youtube.notactuallyyoutube.com/v=dQw4w9WgXcQ(could redirect to anything)
youtube (random text here) v=dQw4w9WgXcQ(not actually a URL)
http://youtube.com/watch?av=$THING_ONE&v=dQw4w9WgXcQ&v=$THING_THREE(depending on how the regex, it either shows thing_one, rick roll, or thing_three, but will always load rick roll on youtube)
So, how should it be done?
- Write a stricter regex
- Use the built-in (in python it’s
urlparse) url parsing features to check the hostname, then use e.g.
parse_qsto turn the query into a dictionary/hash, then take
vout of it.
And it’s not just small projects that have these bugs, even the Reddit Enhancement Suite has it! (in this form:
https://www.youtube.com/watch?v=$ACTUAL_VIDEO&av=$EXPANDO_VIDEO, doesn’t work as link post, and yes, I reported it.)