Analyzing Elisp

Idea

I like poetry. It is art and I love it. I like reading. It is writing and I love it. So are programmers not writers? Sort of.

Recently, I just read Your Code As A Crime Scene and I am breathless on what you can do by analyzing source code. It has practical value in getting a project landscape as well as being smarter on how code evolves on a project. The concepts such as coupling, churn and ownership are really compelling. As such, I am planning to implement a code-maater.el library so I can practice how to make a major mode in Emacs as well as get a grip on how to use code-maat and its tools.

This motivated me also to analyze Elisp code as a tribute to it and experiment. My fundamental question is: is there complexity analysis in Elisp? Before you get excited, in all honesty I have no idea what I am getting at. Merely creating a parser albeit a simple one for lisp is what drove me here. I'm good but not that good.

Code Ideas

So here the things I though of when thinking about static Elisp analysis

Atom Frequency: What atoms are in a file. And probably there frequency.
Expression Complexity: Is there a simple metric involving atoms and depth as a way to measure complexity?
Code Manipulation: Can you manipulate the source tree? A common question and concept.

With that in mind, I will need a parser and a project to analyze. Maybe a shameless plug, I will use my created Elisp parser project elk.el and analyze the source code with it. We programmers love the meta programming or analysis.

Again, I have no idea what I'm doing so just enjoy the flow of questions.

Parser

So while I was a making this parser, I learned how to use cask and travis-ci to make this project. What I did is to separate the parsing module against the token analysis library which I dub elk-magic.el and not another shameless plug.

Since this parser is just something to get me started analyzing source code, I did not take the terminology of the facts and figures probably right. I know how to implement the parser using recursion. At this time, I really don't know what it is called or what I am doing. All I expect is a function that accepts a source code and churns out tokens in a stream. Such is a functional programmer in me but if I may make mistakes in terminology or concept, I would be happy to be corrected. I really just want to see what happens in the source code, I am not planning to create an interpreter.

Here are some terminology that I came across.

Tokens: Everything the source code might be interested in. This is the main data type expressed as a plist.
Comment: Lines starting with ; Comment\n
Whitespace: Everything between the tokens
Text: A text or string if you'd like to call it that "Hello World"
Atom: A symbol or a number, basically what you put in (atom atom atom)
Quotes: The unary expression in Elisp `back-quoting, 'quoting or #'function-quote
Expressions: The s-exp of the source code, recursion is prevalent here.

I am aware that Numbers are not in the list but I do not care about the actual data type for this analysis. One could put a mapper on the tokens just to analyze what data type an atom would be but I am interested in the syntax tokens rather than what they really are and I know that is more complicated I needed.

Honestly, you don't need to care about the problems when I implemented the parser but for those interested.

Character Escaping: Handling \, ? or ?\ made this project harder to implement but should have expected that.
Infinite Recursion: Sometimes the code hangs when there is an error recursively tokenizing the code. Unit testing and modularity would have helped me a lot in my development time

And probably more but those two I should have considered while I was thinking about it. Beside that, implementation of a recursive parser is easy enough but obviously not perfect. After creating the unit tests, I am confident enough that it parses source code 90% of the time.

As an experiment, I tried parsing alert, helm-projectile, prodigy and my own source code and it doesn't hang and gets it right. So with this meaningless parser prelude, let's get into the meat of the topic. Source code analysis. All experimental source code is in elk-magic.el with the tag analyzing-elisp-post.

Atom Frequency

Let's start with a simple list.

(a b c a)

The main parsing function, elk-parse, will yield the hypothetical list of tokens or plists. The structure should be evident or reflect lisp itself with some extra annotation.

(cl-prettyprint (elk-parse "(a b c)"))

;; Too lazy to pretty it properly
((:type expression
        :tokens
        ((:type atom
                :tokens
                nil
                :start-pos
                1
                :end-pos
                2
                :index
                0
                :level
                2
                :id
                2
                :parent-id
                1
                :text
                "a"
                :data-type
                symbol)
         (:type whitespace
                :tokens
                nil
                :start-pos
                2
                :end-pos
                3
                :index
                1
                :level
                2
                :id
                3
                :parent-id
                1
                :text
                " ")
         (:type atom
                :tokens
                nil
                :start-pos
                3
                :end-pos
                4
                :index
                2
                :level
                2
                :id
                4
                :parent-id
                1
                :text
                "b"
                :data-type
                symbol)
         (:type whitespace
                :tokens
                nil
                :start-pos
                4
                :end-pos
                5
                :index
                3
                :level
                2
                :id
                5
                :parent-id
                1
                :text
                " ")
         (:type atom
                :tokens
                nil
                :start-pos
                5
                :end-pos
                6
                :index
                4
                :level
                2
                :id
                6
                :parent-id
                1
                :text
                "c"
                :data-type
                symbol))
        :start-pos
        0
        :end-pos
        -1
        :index
        0
        :level
        1
        :id
        1
        :parent-id
        0))

Take your time in guessing the structure of the token because it takes too much time to explain. I do want to point your attention to the plists with type 'atom. Given this structure, how do we get all the tokens in this list. Do think about it while I present the first code from elk-magic.el.

(defun elk--select-type (type tokens)
  "Filter tokens by a specified type"
  (funcall (-compose
            (-partial #'-filter
                      (lambda (token)
                        (eq (plist-get token :type) type)))
            #'elk--flatten-tokens)
           tokens))

(defun elk--extract-atoms (tokens)
  "Get atoms in tokens"
  (funcall (-compose
            (-partial #'-map (-rpartial #'plist-get :text))
            (-partial #'elk--select-type 'atom))
           tokens))


(cl-prettyprint (elk--extract-atoms (elk-parse "(a b c)")))

("a" "b" "c")

Using the functional style of recursion and mapping, I traverse the tree and get the source code text and the function does return an expected list that reflects the actual structure of the code. But this is not exciting. As promised, let's apply this to elk.el. Here is the output while I group and sort it by frequency.

(defconst elk-file-url "https://raw.githubusercontent.com/FrancisMurillo/elk.el/analyzing-elisp-post/elk.el")
(defconst elk-file "~/Downloads/elk.el")

(require 'f)
(cl-prettyprint (elk-magic-summarize-atoms (elk-parse (f-read-text elk-file))))

;; So here is the token frequency
(("stream" . 61)
 ("elk--use-stream" . 37)
 ("defun" . 34)
 ("token" . 32)
 ("tokens" . 29)
 ("letter" . 27)
 ("current-char" . 25)
 ("nil" . 24)
 ("text" . 24)
 ("this-char" . 23)
 ("setf" . 22)
 ("lambda" . 18)
 ("start-pos" . 18)
 ("recurser" . 18)
 ("current" . 17)
 ("sub-tokens" . 16)
 ("type" . 14)
 (":tokens" . 13)
 ("when" . 12)
 ("incremented-index" . 10)
 ("not" . 9)
 ("-copy" . 9)
 ("stop" . 8)
 ("elk--create-token" . 8)
 ("end-pos" . 7)
 (":type" . 7)
 ("letrec" . 7)
 ("new-token" . 7)
 ("marked-token" . 7)
 ("lexical-let" . 6)
 ("quote-text" . 6)
 ("expression" . 6)
 ("index" . 5)
 ("command" . 5)
 ("pcase" . 5)
 ("require" . 4)
 ("current-value" . 4)
 ("0" . 4)
 ("&optional" . 4)
 ("t" . 4)
 ("base-value" . 4)
 ("elk--stream-next-p" . 4)
 ("elk--quote-p" . 4)
 ("next-char" . 4)
 ("elk--dispatch-stream-consumers" . 4)
 ("value" . 4)
 ("-map" . 4)
 ("leveled-token" . 4)
 ("_" . 4)
 ("seed" . 4)
 ("indexed-token" . 4)
 ("elk--text-stream" . 3)
 ("current-text" . 3)
 ("text-length" . 3)
 ("increment" . 3)
 ("peek" . 3)
 (":end-pos" . 3)
 ("elk--whitespace-p" . 3)
 ("elk--text-quote-p" . 3)
 ("elk--text-escape-p" . 3)
 ("elk--letter-escape-p" . 3)
 ("elk--atom-letter-p" . 3)
 ("start-letter" . 3)
 ("expression-tokens" . 3)
 (":text" . 3)
 ("level" . 3)
 ("generator" . 3)
 ("texify" . 3)
 ("elk-current-tokens" . 3)
 ("elk" . 2)
 ("elk--stream-consumers" . 2)
 ("elk--consume-whitespace" . 2)
 ("elk--consume-comment" . 2)
 ("elk--consume-text" . 2)
 ("elk--consume-quote" . 2)
 ("elk--consume-atom" . 2)
 ("elk--consume-expression" . 2)
 ("base" . 2)
 ("incrementer" . 2)
 ("default" . 2)
 ("-1" . 2)
 ("elk--stream-stop-p" . 2)
 (":start-pos" . 2)
 ("elk--comment-p" . 2)
 ("elk--newline-p" . 2)
 ("elk--function-quote-p" . 2)
 ("elk--back-quote-p" . 2)
 ("elk--expression-start-p" . 2)
 ("elk--expression-close-p" . 2)
 ("whitespace" . 2)
 ("comment" . 2)
 ("base-token" . 2)
 (":quote-text" . 2)
 ("push" . 2)
 ("handler" . 2)
 ("elk--attach-source" . 2)
 ("elk--attach-level" . 2)
 ("1" . 2)
 ("elk--attach-token-id" . 2)
 ("incremental-sequence" . 2)
 ("start" . 2)
 ("parent-id" . 2)
 (":id" . 2)
 ("elk--attach-expression-index" . 2)
 ("elk--attach-atom-type" . 2)
 ("typer" . 2)
 ("number" . 2)
 ("elk--parsing" . 2)
 ("elk--parse" . 2)
 ("parsing" . 2)
 ("-compose" . 2)
 ("-partial" . 2)
 ("text-tokens" . 2)
 ("source-text" . 2)
 ("cl-lib" . 1)
 ("dash" . 1)
 ("dash-functional" . 1)
 ("s" . 1)
 ("defgroup" . 1)
 (":prefix" . 1)
 (":group" . 1)
 ("tools" . 1)
 (":link" . 1)
 ("url-link" . 1)
 (":tag" . 1)
 ("elk--started-stream" . 1)
 ("s-matches-p" . 1)
 ("unless" . 1)
 (":level" . 1)
 (":parent-id" . 1)
 ("-map-indexed" . 1)
 (":index" . 1)
 ("zerop" . 1)
 ("symbol" . 1)
 (":data-type" . 1)
 ("elk--codify" . 1)
 ("s-join" . 1)
 ("-flatten" . 1)
 ("elk-parse" . 1)
 ("region-active-p" . 1))

A lot of tokens indeed! So the question is: what does this tell us? First of, there is already a lot of junk such function parameters(:level, &optional), numbers(0, 1) and other known functions(defun, not). So to say this is expected but again what does it mean for a token to be frequent or otherwise?

Honestly, I don't know. I do want to know what keywords best identify a source code. I guess I am merely grasping at straws here. I could tighten up the filter for what is a meaningful atom but I could still run in the same problem.

An unoriginal idea would be to create an linter on what keywords or names should be allowed such as enforcing a schema but I don't want to go there.

So first idea is a bust.

Expression Complexity

This is another wild idea but the gist is that given an s-exp, is there a metric to compute complexity? I am not computer science professor but a simple metric can go like this and remember atoms have a level property.

(defun elk-magic--token-depth (token)
  "Find out the TOKEN depth or the maximum number of level it has."
  (letrec ((recurser
       (lambda (token)
         (let* ((token-level (plist-get token :level))
             (sub-tokens (plist-get token :tokens))
             (sub-token-depths
              (-map recurser sub-tokens)))
           (if sub-token-depths
               (-max sub-token-depths)
             token-level)))))
    (funcall recurser token)))

(defun elk-magic--token-atoms (token)
  "Find out the TOKEN child atoms up to the last depth."
  (letrec ((recurser
       (lambda (token)
         (let* ((token-type (plist-get token :type))
             (sub-tokens (plist-get token :tokens)))
           (if (eq token-type 'atom)
               (list token)
             (apply #'append (-map recurser sub-tokens)))))))
    (funcall recurser token)))


(defun elk-magic--token-complexity (token)
  "Compute expression or TOKEN complexity."
  (let* ((atoms (elk-magic--token-atoms token))
      (root-level (plist-get token :level))
      (atom-complexity (/ (float (length atoms)))))
    (-sum
     (-map (lambda (atom)
             (let* ((depth (- (plist-get atom :level) root-level))
                 (depth-complexity depth))
               (* atom-complexity depth-complexity)))
           atoms))))

In short, the sum of ((/ number-of-tokens) * (/ current-atom-level main-expression-level)). The idea is weighted atom levels. By giving each atom a weight based on the number of atoms in total and factoring in on how nested that atom is, it should be a good guess of complexity. I guess. So for a quick gist, let's apply it to the snippet above.

;; Code from above
(defconst snippet-code (buffer-substring-no-properties (region-beginning) (region-end)))

(cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse snippet-code)))

;; 0 are the whitespace, whitespace has no complexity or is there?
(0
 6.357142857142854 ; elk-magic--token-depth
 0
 6.599999999999997 ; elk-magic--token-atoms
 0
 5.727272727272726 ; elk-magic--token-complexity
 )

Again what does a 5 or 6 tell us? We need more context.

;; A normal list
(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse "(1 2 3 4)")))
; Result
(1.0)

(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse "'a-regular-atom")))
; Result
(1.0) ; Baseline


;; A require
(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse "(require 'elk)")))
; Result
(1.5)


;; A normal function
(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse "(defun hello-elk () (interactive) (message \"Hello Elk\"))")))
; Result
(1.5)


;; A battle of recursion
(require 'subr-x)
(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse (string-trim-left "
(defun factorial-linear (n)
  (interactive)
  (let ((value 1)
        (counter 1 ))
    (while (<= counter n)
      (setf value (* value counter))
      (setf counter (1+ counter)))))"))))
; Result
(3.6363636363636376) ; Impretive is not that complicated

(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse (string-trim-left "
(defun factorial-recursive (n)
  (interactive)
  (if (zerop n) 1
    (* n (factorial-recursive (1- n))))
  )"))))
; Result
(2.769230769230769) ; Functional is less complicated!?


;; Very nested
(cl-prettyprint (mapcar #'elk-magic--token-complexity
                        (elk-parse (string-trim-left "
(1 (2 (3 (4 (5 (6 (7 (8 (9 (10 (11 (12 (13 (14 (15 (16))))))))))))))))"))))
; Result
(8.5) ; Complicated indeed

So the results are more encouraging than the previous but again what does this tell us? I still don't know but here is what I think.

If it is around 1 to 5, the code is considered simple
If it is around 6 to 8, the code is non-trivial
Anything higher than 8, the code is complex

The case may vary and no one can really say if the code should be complex or simple, that is in experience but having a simple metric is kinda nice. I do want to say that this metric has its flaws and can be faked but again it is nice. One can compute the average complexity of a code and then track it over time or simply get the a static complexity landscape. Or one can just do an ocular and say this code needs refactoring. Nothing beats experience I think.

Quite encouraging for a simple experiment. I say as well that this might beat whitespace analysis if people prefer condensed code. I don't know but for now this is good.

Code Manipulation

If you have the syntax tree then you can manipulate it, right? This is the common use of a syntax tree but can also be used as a formatter but Elisp already has this. I can give two trivial examples which kinda makes this implementing in a more Emacs way fun.

Nearest top level expression at point
Manipulating the code

The first one is very easy.

(defun elk-magic--nearest-top-expression-at-point ()
  "Get token expression that is nearest to the highest point"
  (interactive)
  (let* ((source-text (buffer-substring-no-properties (point-min) (point-max)))
         (tokens (elk-parse source-text))
         (expression-token (-first (lambda (token)
                                     (and (= (plist-get token :level) 1)
                                          (eq (plist-get token :type) 'expression)
                                          (<= (plist-get token :start-pos) (point))
                                          (>= (plist-get token :end-pos) (point))))
                                   tokens)))
    (if expression-token
        (goto-char (1+ (plist-get expression-token :start-pos)))
      (message "No near top level expression at point"))))

; Complexity: 6.377777777777779 (Non-trivial)

Just simply parse the code, filter the highest level expressions with the ones between the point, and simply point at the start-pos. The sad thing is that if the code is large this has to parse the whole code just to figure out where to land. There is an easier way with paredit.

(defun paredit--nearest-top-level-expression-at-point ()
  "Above but using paredit"
  (interactive)
  (condition-case ex
      (while t (paredit-backward-up))
    ('error nil)))
; Complexity: 2.5(Simple)

Even, the complexity metric says this. Wow… just goes to show Emacs rocks. You don't need the syntax tree just to move around the code. But how about something more juicy but not really useful

Let's say we have this snippet.

(defun x () nil)

(defvar y nil)

Nothing special right? Yeah. For the sake of example, how about we want to namespace it… say with z? Let's use the syntax tree.

This be way harder than just doing a macro replace, I am not selling myself that much huh?

(defun namespacer (namespace)
  (interactive)
  (let* ((source-code (buffer-substring-no-properties (point-min) (point-max)))
      (raw-tokens (elk-parse source-code))
      (token-table (elk-magic--create-token-table raw-tokens))
      (just-tokens (elk-magic--discard-filler raw-tokens))
      (interface-expression-p
       (lambda (expression)
         (let* ((sub-tokens (plist-get expression :tokens))
             (header-atom (nth 0 sub-tokens))
             (name-atom (nth 1 sub-tokens))
             (header-text (plist-get header-atom :text))
             (name-text (plist-get name-atom :text)))
           (or (string-equal header-text "defun")
              (string-equal header-text "defvar")))))
      (interface-expressions
       (-filter
        (lambda (token)
          (and (eq (plist-get token :type) 'expression)
             (= (plist-get token :level) 1)
             (funcall interface-expression-p token)))
        just-tokens))
      (interace-atoms
       (-map
        (lambda (token)
          (let* ((interface-atom (nth 1 (plist-get token :tokens)))
              (interface-text (plist-get interface-atom :text))
              (new-interface-text (concat namespace interface-text)))
            (plist-put interface-atom :text new-interface-text)))
        interface-expressions)))
    (elk--codify raw-tokens)))
; Complexity: 7.2065217391304355(Non-trivial?)

The code is indeed complicated but it works. Applying it to this code with (namespace "my-namespace-"), would prefix namespacer with my-namespace-namespacer. I could paste the code but would be merely a one line change.

The sad thing is that you can do this with a macro and probably be safer but this shows how you can manipulate the code using the tokens. This is more of a PoC than anything else. There is too much boiler plate just to change the function name, there might be a better paradigm. Esprima anyone?

Conclusion

So nothing much to be impressed… for now. I haven't finished the book yet but now I have a tool to apply some analysis for Elisp. I'm still looking for ideas about analyzing code at face, not by context. Code analysis has been done too much, what I am looking at is how code can be analyzed as a paragraph or as a poem. Automatic summarization and language processing, is there an analogy for code? We have code generation and other stuff but how about things that tell you about the structure or abstract.

Maybe I am talking in riddles but I can say this is a fun project although will little returns. Hmm… I wonder what else I can analyze about the tokens or there proximity.

I am a writer. I am a coder. I am a writer and a coder