Jump to content

Common Lisp/External libraries/CL-PPCRE

From Wikibooks, open books for an open world

Common Lisp Portable Perl Compatible Regular Expression library, or CL-PPCRE, brings the power of Perl regular expressions to Common Lisp. In the words of the author, Edi Weitz, CL-PPCRE has the following features:

  • It is compatible with Perl.
  • It is pretty fast.
  • It is portable between ANSI-compliant Common Lisp implementations.
  • It is thread-safe.
  • In addition to specifying regular expressions as strings like in Perl you can also use S-expressions.
  • It comes with a BSD-style license so you can basically do with it whatever you want.

Basic Usage

[edit | edit source]

The main entry point to CL-PPCRE is the scan function. Scan takes a regular expression (or regex) and a string to match it on and returns the matched start and end indices for the regex and the start and end indices of any registers you defined in the regex.

CL-USER> (scan "b(.)r" "foo bar baz bur")
4
7
#(5)
#(6)

The first and second return values are the start and end of the matching substring, respectively. The third and fourth return values mark the start and end indices of register matches. Notice that it only found the first instance of bar. To find the next instance, you may pass values for the keyword parameter :start. It goes without saying that you can also limit how far the scan runs along the string by specifying the :end keyword.

;; Match the next instance of "b.r" at or after position 5
CL-USER> (scan "b.r" "foo bar baz bur" :start 5)
12
15
#()
#()

;; This fails, because there is no match between 5 and 13 
;; (even though the regex has started at 12, it doesn't 
;; finish by 13)
CL-USER> (scan "b.r" "foo bar baz bur" :start 5 :end 13)
NIL

As you may have noticed, keeping track of start points while scanning a string can be a bit tedious. For this, CL-PPCRE has several convenience functions and macros for common tasks such as:

  • do-scans
  • scan-to-strings
  • do-matches
  • do-register-groups
  • all-matches
  • split
  • regex-replace
  • regex-replace-all

Regular Expressions as Trees and Closures

[edit | edit source]

While CL-PPCRE takes regular expressions as strings, it actually parses that string into a regular expression tree. Besides being the Lispy thing to do, this also removes much of the cryptic nature of regular expressions, however, it trades it for verbosity. One nice feature of this is that since you have the regular expression's parse tree at hand, it is straight forward to build or alter regular expressions programatically on the fly.

;;; The unexported function cl-ppcre::parse-string parses a 
;;; regex string into a regex tree.  This allows us to see
;;; how strings translate to trees.

;; Character alternatives, Classes, wildcards
CL-USER> (cl-ppcre::parse-string "[abcdef][^abcdef]")
(:SEQUENCE (:CHAR-CLASS #\a #\b #\c #\d #\e #\f)
 (:INVERTED-CHAR-CLASS #\a #\b #\c #\d #\e #\f))
;; Note the double backslashes standard Lisp technique to 
;; get a literal backslash
CL-USER> (cl-ppcre::parse-string ".\\d\\D\\w\\W\\s\\S")
(:SEQUENCE :EVERYTHING :DIGIT-CLASS :NON-DIGIT-CLASS :WORD-CHAR-CLASS 
 :NON-WORD-CHAR-CLASS :WHITESPACE-CHAR-CLASS :NON-WHITESPACE-CHAR-CLASS)

;; Repetitions (Note that "a*?" doesn't make much sense as 
;; it will always match the empty string, but it does this
;; correctly)
CL-USER> (cl-ppcre::parse-string "Greedy:a*b+c{2,5}d{2,}Non-greedy:a*?b+?c{2,5}?d{2,}?")
(:SEQUENCE
 "Greedy:"
 (:GREEDY-REPETITION 0 NIL #\a)
 (:GREEDY-REPETITION 1 NIL #\b)
 (:GREEDY-REPETITION 2 5 #\c)
 (:GREEDY-REPETITION 2 NIL #\d)
 "Non-greedy:"
 (:NON-GREEDY-REPETITION 0 NIL #\a)
 (:NON-GREEDY-REPETITION 1 NIL #\b)
 (:NON-GREEDY-REPETITION 2 5 #\c)
 (:NON-GREEDY-REPETITION 2 NIL #\d))

;; Alternatives
CL-USER> (cl-ppcre::parse-string "a|b|c")
(:ALTERNATION #\a #\b #\c)

;; Groups, Registers, and Back References.
CL-USER> (cl-ppcre::parse-string "(a..)+(?:def)+\\1")
(:SEQUENCE
 (:GREEDY-REPETITION 1 NIL (:REGISTER (:SEQUENCE #\a :EVERYTHING :EVERYTHING)))
 (:GREEDY-REPETITION 1 NIL (:GROUP "def")) (:BACK-REFERENCE 1))
;; This matches strings like "abcdefdefabc"

After it has the tree form for the regular expression, CL-PPCRE compiles that representation using the function create-scanner (many Lisps compile down to native machine code). This does two things: (1) it tends to make regular expression scans very fast, and (2) it tends to make it expensive to define a regular expression the first time (due to the compilation overhead, although there are variables you can set to reduce this). Once compilation is done, you can reuse the same expression circumventing the compilation overhead. CL-PPCRE also uses proper algorithms producing an efficient regular expression representation, i.e. not a stack based system. Overall, this is a quite efficient library.

;; Create a scanner closure that finds matches that resemble IPs
;; and fill registers with the numbers
CL-USER> (cl-ppcre::create-scanner "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})")
#<CLOSURE (LAMBDA (STRING CL-PPCRE::START CL-PPCRE::END)) {12E62C3D}>
NIL

;; using * as last return value
CL-USER> (cl-ppcre::scan-to-strings * "192.168.1.255")
"192.168.1.255"
#("192" "168" "1" "255")

Examples

[edit | edit source]

Well, using CL-PPCRE isn't that much different from using any other regular expression engine. Here we will see some quick and dirty examples of what you can do. If you get stuck on how to do something, consult regular expression tutorials, the perlre manpage, or a Perl user. You just have to remember to double backslash your regexs (as you always have to do to insert literal backslashes into Lisp strings).

Scanning for HTML tags

[edit | edit source]
CL-USER> (defparameter *url-regex*  "((([A-Za-z]{3,9}:(?://)?)(?:[-;:&=+$,\\w]+@)?[A-Za-z0-9.-]+|(?:www\\.|[-;:&=+$,\\w]+@)[A-Za-z0-9.-]+)((?:/[+~%/.\\w-]*)?\\??(?:[-+=&;%@.\\w]*)#?(?:[.!/\\w]*))?)")

CL-USER> (cl-ppcre::parse-string *url-regex*)
(:REGISTER                                                                                                                                                                                                                                                     
 (:SEQUENCE                                                                                                                                                                                                                                                    
  (:REGISTER                                                                                                                                                                                                                                                   
   (:ALTERNATION                                                                                                                                                                                                                                               
    (:SEQUENCE (:REGISTER (:SEQUENCE (:GREEDY-REPETITION 3 9 (:CHAR-CLASS (:RANGE #\A #\Z) (:RANGE #\a #\z))) #\: (:GREEDY-REPETITION 0 1 (:GROUP "//"))))                                                                                                     
     (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS #\- #\; #\: #\& #\= #\+ #\$ #\, :WORD-CHAR-CLASS)) #\@)))                                                                                                               
     (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\A #\Z) (:RANGE #\a #\z) (:RANGE #\0 #\9) #\. #\-)))                                                                                                                                                      
    (:SEQUENCE (:GROUP (:ALTERNATION "www." (:SEQUENCE (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS #\- #\; #\: #\& #\= #\+ #\$ #\, :WORD-CHAR-CLASS)) #\@)))                                                                                                        
     (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\A #\Z) (:RANGE #\a #\z) (:RANGE #\0 #\9) #\. #\-)))))                                                                                                                                                    
  (:GREEDY-REPETITION 0 1                                                                                                                                                                                                                                      
   (:REGISTER                                                                                                                                                                                                                                                  
    (:SEQUENCE (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE #\/ (:GREEDY-REPETITION 0 NIL (:CHAR-CLASS #\+ #\~ #\% #\/ #\. :WORD-CHAR-CLASS #\-)))))                                                                                                             
     (:GREEDY-REPETITION 0 1 #\?) (:GROUP (:GREEDY-REPETITION 0 NIL (:CHAR-CLASS #\- #\+ #\= #\& #\; #\% #\@ #\. :WORD-CHAR-CLASS)))                                                                                                                           
     (:GREEDY-REPETITION 0 1 #\#) (:GROUP (:GREEDY-REPETITION 0 NIL (:CHAR-CLASS #\. #\! #\/ :WORD-CHAR-CLASS))))))))                                                                                                                                          
                                                                                                      
CL-USER> (cl-ppcre::scan-to-strings *url-regex* "yo mailto:will@foo.com asd http://foo.com")
"mailto:will@foo.com"                                                                                                                         
#("mailto:will@foo.com" "mailto:will@foo.com" "mailto:" "")

Finding users in /etc/passwd

[edit | edit source]

Further reading

[edit | edit source]

http://www.weitz.de/cl-ppcre/ Edi Weitz's CL-PPCRE page