Handling of youtube URLs is broken

By Puck Meerburg

What’s the problem with this regex, made to match a YouTube URL and return an ID? (From CloudBot)

/(?:youtube.*?(?:v=|/v/)|youtu\.be/|yooouuutuuube.*?id=)([-_a-zA-Z0-9]+)/i

(I came across this trying to break an IRC bot based on CloudBot, of course!)

Let’s follow a specific path to simplify the regex: /(?:youtube.*?v=)([-_a-za-Z0-9]+)/i (this part is meant to match e.g. http://youtube.com/watch?v=$VIDEO_ID) Basically, the regex is very sloppy, resulting in it matching too much, for example:

  • youtube.notactuallyyoutube.com/v=dQw4w9WgXcQ (could redirect to anything)
  • youtube (random text here) v=dQw4w9WgXcQ (not actually a URL)
  • http://youtube.com/watch?av=$THING_ONE&v=dQw4w9WgXcQ&v=$THING_THREE (depending on how the regex, it either shows thing_one, rick roll, or thing_three, but will always load rick roll on youtube)

So, how should it be done?

  • Write a stricter regex
  • Use the built-in (in python it’s urlparse) url parsing features to check the hostname, then use e.g. parse_qs to turn the query into a dictionary/hash, then take v out of it.

And it’s not just small projects that have these bugs, even the Reddit Enhancement Suite has it! (in this form: https://www.youtube.com/watch?v=$ACTUAL_VIDEO&av=$EXPANDO_VIDEO, doesn’t work as link post, and yes, I reported it.)