URL Regex Pattern

| | Comments (0) | TrackBacks (0)

Found in the Nutch source. A regex pattern used for matching URL strings in text. Used with Perl5Compiler in Jakarta ORO.

"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)"

This expression is almost as long as Gilda's death scene in Rigoletto.

[Update]
A bug in FF (or my CSS) truncates the pattern above. Here it is in four pieces, there are no spaces between them.

"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/]
(([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2})
{1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:
@&~=%-]{0,1000}))?)"

0 TrackBacks

Listed below are links to blogs that reference this entry: URL Regex Pattern.

TrackBack URL for this entry: http://www.manamplified.org/cgi-bin/mt-tb.cgi/321

Leave a comment