Idea
I like poetry. It is art and I love it. I like reading. It is writing and I love it. So are programmers not writers? Sort of.
Recently, I just read Your Code As A Crime Scene and I am breathless on what you can do by analyzing source code. It has practical value in getting a project landscape as well as being smarter on how code evolves on a project. The concepts such as coupling, churn and ownership are really compelling. As such, I am planning to implement a code-maater.el library so I can practice how to make a major mode in Emacs as well as get a grip on how to use code-maat and its tools.
This motivated me also to analyze Elisp code as a tribute to it and experiment. My fundamental question is: is there complexity analysis in Elisp? Before you get excited, in all honesty I have no idea what I am getting at. Merely creating a parser albeit a simple one for lisp is what drove me here. I'm good but not that good.
Code Ideas
So here the things I though of when thinking about static Elisp analysis
- Atom Frequency
- What atoms are in a file. And probably there frequency.
- Expression Complexity
- Is there a simple metric involving atoms and depth as a way to measure complexity?
- Code Manipulation
- Can you manipulate the source tree? A common question and concept.
With that in mind, I will need a parser and a project to analyze. Maybe a shameless plug, I will use my created Elisp parser project elk.el and analyze the source code with it. We programmers love the meta programming or analysis.
Again, I have no idea what I'm doing so just enjoy the flow of questions.
Parser
So while I was a making this parser, I learned how to use cask and travis-ci to make this project. What I did is to separate the parsing module against the token analysis library which I dub elk-magic.el and not another shameless plug.
Since this parser is just something to get me started analyzing source code, I did not take the terminology of the facts and figures probably right. I know how to implement the parser using recursion. At this time, I really don't know what it is called or what I am doing. All I expect is a function that accepts a source code and churns out tokens in a stream. Such is a functional programmer in me but if I may make mistakes in terminology or concept, I would be happy to be corrected. I really just want to see what happens in the source code, I am not planning to create an interpreter.
Here are some terminology that I came across.
- Tokens
- Everything the source code might be interested in. This is the main data type expressed as a plist.
- Comment
- Lines starting with
; Comment\n
- Whitespace
- Everything between the tokens
- Text
- A text or string if you'd like to call it that
"Hello World"
- Atom
- A symbol or a number, basically what you put in
(atom atom atom)
- Quotes
- The unary expression in Elisp
`back-quoting
,'quoting
or#'function-quote
- Expressions
- The s-exp of the source code, recursion is prevalent here.
I am aware that Numbers are not in the list but I do not care about the actual data type for this analysis. One could put a mapper on the tokens just to analyze what data type an atom would be but I am interested in the syntax tokens rather than what they really are and I know that is more complicated I needed.
Honestly, you don't need to care about the problems when I implemented the parser but for those interested.
- Character Escaping
- Handling
\
,?
or?\
made this project harder to implement but should have expected that. - Infinite Recursion
- Sometimes the code hangs when there is an error recursively tokenizing the code. Unit testing and modularity would have helped me a lot in my development time
And probably more but those two I should have considered while I was thinking about it. Beside that, implementation of a recursive parser is easy enough but obviously not perfect. After creating the unit tests, I am confident enough that it parses source code 90% of the time.
As an experiment, I tried parsing alert
, helm-projectile
,
prodigy
and my own source code and it doesn't hang and gets it
right. So with this meaningless parser prelude, let's get into the
meat of the topic. Source code analysis. All experimental source code
is in elk-magic.el
with the tag analyzing-elisp-post.
Atom Frequency
Let's start with a simple list.
(a b c a)
The main parsing function, elk-parse
, will yield the hypothetical
list of tokens or plists. The structure should be evident or reflect
lisp itself with some extra annotation.
(cl-prettyprint (elk-parse "(a b c)")) ;; Too lazy to pretty it properly ((:type expression :tokens ((:type atom :tokens nil :start-pos 1 :end-pos 2 :index 0 :level 2 :id 2 :parent-id 1 :text "a" :data-type symbol) (:type whitespace :tokens nil :start-pos 2 :end-pos 3 :index 1 :level 2 :id 3 :parent-id 1 :text " ") (:type atom :tokens nil :start-pos 3 :end-pos 4 :index 2 :level 2 :id 4 :parent-id 1 :text "b" :data-type symbol) (:type whitespace :tokens nil :start-pos 4 :end-pos 5 :index 3 :level 2 :id 5 :parent-id 1 :text " ") (:type atom :tokens nil :start-pos 5 :end-pos 6 :index 4 :level 2 :id 6 :parent-id 1 :text "c" :data-type symbol)) :start-pos 0 :end-pos -1 :index 0 :level 1 :id 1 :parent-id 0))
Take your time in guessing the structure of the token because it takes
too much time to explain. I do want to point your attention to the
plists with type 'atom
. Given this structure, how do we get all the
tokens in this list. Do think about it while I present the first code
from elk-magic.el
.
(defun elk--select-type (type tokens) "Filter tokens by a specified type" (funcall (-compose (-partial #'-filter (lambda (token) (eq (plist-get token :type) type))) #'elk--flatten-tokens) tokens)) (defun elk--extract-atoms (tokens) "Get atoms in tokens" (funcall (-compose (-partial #'-map (-rpartial #'plist-get :text)) (-partial #'elk--select-type 'atom)) tokens)) (cl-prettyprint (elk--extract-atoms (elk-parse "(a b c)"))) ("a" "b" "c")
Using the functional style of recursion and mapping, I traverse the
tree and get the source code text and the function does return an
expected list that reflects the actual structure of the code. But this
is not exciting. As promised, let's apply this to elk.el
. Here is
the output while I group and sort it by frequency.
(defconst elk-file-url "https://raw.githubusercontent.com/FrancisMurillo/elk.el/analyzing-elisp-post/elk.el") (defconst elk-file "~/Downloads/elk.el") (require 'f) (cl-prettyprint (elk-magic-summarize-atoms (elk-parse (f-read-text elk-file)))) ;; So here is the token frequency (("stream" . 61) ("elk--use-stream" . 37) ("defun" . 34) ("token" . 32) ("tokens" . 29) ("letter" . 27) ("current-char" . 25) ("nil" . 24) ("text" . 24) ("this-char" . 23) ("setf" . 22) ("lambda" . 18) ("start-pos" . 18) ("recurser" . 18) ("current" . 17) ("sub-tokens" . 16) ("type" . 14) (":tokens" . 13) ("when" . 12) ("incremented-index" . 10) ("not" . 9) ("-copy" . 9) ("stop" . 8) ("elk--create-token" . 8) ("end-pos" . 7) (":type" . 7) ("letrec" . 7) ("new-token" . 7) ("marked-token" . 7) ("lexical-let" . 6) ("quote-text" . 6) ("expression" . 6) ("index" . 5) ("command" . 5) ("pcase" . 5) ("require" . 4) ("current-value" . 4) ("0" . 4) ("&optional" . 4) ("t" . 4) ("base-value" . 4) ("elk--stream-next-p" . 4) ("elk--quote-p" . 4) ("next-char" . 4) ("elk--dispatch-stream-consumers" . 4) ("value" . 4) ("-map" . 4) ("leveled-token" . 4) ("_" . 4) ("seed" . 4) ("indexed-token" . 4) ("elk--text-stream" . 3) ("current-text" . 3) ("text-length" . 3) ("increment" . 3) ("peek" . 3) (":end-pos" . 3) ("elk--whitespace-p" . 3) ("elk--text-quote-p" . 3) ("elk--text-escape-p" . 3) ("elk--letter-escape-p" . 3) ("elk--atom-letter-p" . 3) ("start-letter" . 3) ("expression-tokens" . 3) (":text" . 3) ("level" . 3) ("generator" . 3) ("texify" . 3) ("elk-current-tokens" . 3) ("elk" . 2) ("elk--stream-consumers" . 2) ("elk--consume-whitespace" . 2) ("elk--consume-comment" . 2) ("elk--consume-text" . 2) ("elk--consume-quote" . 2) ("elk--consume-atom" . 2) ("elk--consume-expression" . 2) ("base" . 2) ("incrementer" . 2) ("default" . 2) ("-1" . 2) ("elk--stream-stop-p" . 2) (":start-pos" . 2) ("elk--comment-p" . 2) ("elk--newline-p" . 2) ("elk--function-quote-p" . 2) ("elk--back-quote-p" . 2) ("elk--expression-start-p" . 2) ("elk--expression-close-p" . 2) ("whitespace" . 2) ("comment" . 2) ("base-token" . 2) (":quote-text" . 2) ("push" . 2) ("handler" . 2) ("elk--attach-source" . 2) ("elk--attach-level" . 2) ("1" . 2) ("elk--attach-token-id" . 2) ("incremental-sequence" . 2) ("start" . 2) ("parent-id" . 2) (":id" . 2) ("elk--attach-expression-index" . 2) ("elk--attach-atom-type" . 2) ("typer" . 2) ("number" . 2) ("elk--parsing" . 2) ("elk--parse" . 2) ("parsing" . 2) ("-compose" . 2) ("-partial" . 2) ("text-tokens" . 2) ("source-text" . 2) ("cl-lib" . 1) ("dash" . 1) ("dash-functional" . 1) ("s" . 1) ("defgroup" . 1) (":prefix" . 1) (":group" . 1) ("tools" . 1) (":link" . 1) ("url-link" . 1) (":tag" . 1) ("elk--started-stream" . 1) ("s-matches-p" . 1) ("unless" . 1) (":level" . 1) (":parent-id" . 1) ("-map-indexed" . 1) (":index" . 1) ("zerop" . 1) ("symbol" . 1) (":data-type" . 1) ("elk--codify" . 1) ("s-join" . 1) ("-flatten" . 1) ("elk-parse" . 1) ("region-active-p" . 1))
A lot of tokens indeed! So the question is: what does this tell us?
First of, there is already a lot of junk such function
parameters(:level
, &optional
), numbers(0
, 1
) and other known
functions(defun
, not
). So to say this is expected but again what
does it mean for a token to be frequent or otherwise?
Honestly, I don't know. I do want to know what keywords best identify a source code. I guess I am merely grasping at straws here. I could tighten up the filter for what is a meaningful atom but I could still run in the same problem.
An unoriginal idea would be to create an linter on what keywords or names should be allowed such as enforcing a schema but I don't want to go there.
So first idea is a bust.
Expression Complexity
This is another wild idea but the gist is that given an s-exp, is there a metric to compute complexity? I am not computer science professor but a simple metric can go like this and remember atoms have a level property.
(defun elk-magic--token-depth (token) "Find out the TOKEN depth or the maximum number of level it has." (letrec ((recurser (lambda (token) (let* ((token-level (plist-get token :level)) (sub-tokens (plist-get token :tokens)) (sub-token-depths (-map recurser sub-tokens))) (if sub-token-depths (-max sub-token-depths) token-level))))) (funcall recurser token))) (defun elk-magic--token-atoms (token) "Find out the TOKEN child atoms up to the last depth." (letrec ((recurser (lambda (token) (let* ((token-type (plist-get token :type)) (sub-tokens (plist-get token :tokens))) (if (eq token-type 'atom) (list token) (apply #'append (-map recurser sub-tokens))))))) (funcall recurser token))) (defun elk-magic--token-complexity (token) "Compute expression or TOKEN complexity." (let* ((atoms (elk-magic--token-atoms token)) (root-level (plist-get token :level)) (atom-complexity (/ (float (length atoms))))) (-sum (-map (lambda (atom) (let* ((depth (- (plist-get atom :level) root-level)) (depth-complexity depth)) (* atom-complexity depth-complexity))) atoms))))
In short, the sum of ((/ number-of-tokens) * (/ current-atom-level
main-expression-level))
. The idea is weighted atom levels. By giving
each atom a weight based on the number of atoms in total and factoring
in on how nested that atom is, it should be a good guess of
complexity. I guess. So for a quick gist, let's apply it to the
snippet above.
;; Code from above (defconst snippet-code (buffer-substring-no-properties (region-beginning) (region-end))) (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse snippet-code))) ;; 0 are the whitespace, whitespace has no complexity or is there? (0 6.357142857142854 ; elk-magic--token-depth 0 6.599999999999997 ; elk-magic--token-atoms 0 5.727272727272726 ; elk-magic--token-complexity )
Again what does a 5 or 6 tell us? We need more context.
;; A normal list (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse "(1 2 3 4)"))) ; Result (1.0) (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse "'a-regular-atom"))) ; Result (1.0) ; Baseline ;; A require (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse "(require 'elk)"))) ; Result (1.5) ;; A normal function (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse "(defun hello-elk () (interactive) (message \"Hello Elk\"))"))) ; Result (1.5) ;; A battle of recursion (require 'subr-x) (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse (string-trim-left " (defun factorial-linear (n) (interactive) (let ((value 1) (counter 1 )) (while (<= counter n) (setf value (* value counter)) (setf counter (1+ counter)))))")))) ; Result (3.6363636363636376) ; Impretive is not that complicated (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse (string-trim-left " (defun factorial-recursive (n) (interactive) (if (zerop n) 1 (* n (factorial-recursive (1- n)))) )")))) ; Result (2.769230769230769) ; Functional is less complicated!? ;; Very nested (cl-prettyprint (mapcar #'elk-magic--token-complexity (elk-parse (string-trim-left " (1 (2 (3 (4 (5 (6 (7 (8 (9 (10 (11 (12 (13 (14 (15 (16))))))))))))))))")))) ; Result (8.5) ; Complicated indeed
So the results are more encouraging than the previous but again what does this tell us? I still don't know but here is what I think.
- If it is around 1 to 5, the code is considered simple
- If it is around 6 to 8, the code is non-trivial
- Anything higher than 8, the code is complex
The case may vary and no one can really say if the code should be complex or simple, that is in experience but having a simple metric is kinda nice. I do want to say that this metric has its flaws and can be faked but again it is nice. One can compute the average complexity of a code and then track it over time or simply get the a static complexity landscape. Or one can just do an ocular and say this code needs refactoring. Nothing beats experience I think.
Quite encouraging for a simple experiment. I say as well that this might beat whitespace analysis if people prefer condensed code. I don't know but for now this is good.
Code Manipulation
If you have the syntax tree then you can manipulate it, right? This is the common use of a syntax tree but can also be used as a formatter but Elisp already has this. I can give two trivial examples which kinda makes this implementing in a more Emacs way fun.
- Nearest top level expression at point
- Manipulating the code
The first one is very easy.
(defun elk-magic--nearest-top-expression-at-point () "Get token expression that is nearest to the highest point" (interactive) (let* ((source-text (buffer-substring-no-properties (point-min) (point-max))) (tokens (elk-parse source-text)) (expression-token (-first (lambda (token) (and (= (plist-get token :level) 1) (eq (plist-get token :type) 'expression) (<= (plist-get token :start-pos) (point)) (>= (plist-get token :end-pos) (point)))) tokens))) (if expression-token (goto-char (1+ (plist-get expression-token :start-pos))) (message "No near top level expression at point")))) ; Complexity: 6.377777777777779 (Non-trivial)
Just simply parse the code, filter the highest level expressions with
the ones between the point, and simply point at the start-pos
. The
sad thing is that if the code is large this has to parse the whole
code just to figure out where to land. There is an easier way with
paredit
.
(defun paredit--nearest-top-level-expression-at-point () "Above but using paredit" (interactive) (condition-case ex (while t (paredit-backward-up)) ('error nil))) ; Complexity: 2.5(Simple)
Even, the complexity metric says this. Wow… just goes to show Emacs rocks. You don't need the syntax tree just to move around the code. But how about something more juicy but not really useful
Let's say we have this snippet.
(defun x () nil) (defvar y nil)
Nothing special right? Yeah. For the sake of example, how about we
want to namespace it… say with z
? Let's use the syntax tree.
This be way harder than just doing a macro replace, I am not selling myself that much huh?
(defun namespacer (namespace) (interactive) (let* ((source-code (buffer-substring-no-properties (point-min) (point-max))) (raw-tokens (elk-parse source-code)) (token-table (elk-magic--create-token-table raw-tokens)) (just-tokens (elk-magic--discard-filler raw-tokens)) (interface-expression-p (lambda (expression) (let* ((sub-tokens (plist-get expression :tokens)) (header-atom (nth 0 sub-tokens)) (name-atom (nth 1 sub-tokens)) (header-text (plist-get header-atom :text)) (name-text (plist-get name-atom :text))) (or (string-equal header-text "defun") (string-equal header-text "defvar"))))) (interface-expressions (-filter (lambda (token) (and (eq (plist-get token :type) 'expression) (= (plist-get token :level) 1) (funcall interface-expression-p token))) just-tokens)) (interace-atoms (-map (lambda (token) (let* ((interface-atom (nth 1 (plist-get token :tokens))) (interface-text (plist-get interface-atom :text)) (new-interface-text (concat namespace interface-text))) (plist-put interface-atom :text new-interface-text))) interface-expressions))) (elk--codify raw-tokens))) ; Complexity: 7.2065217391304355(Non-trivial?)
The code is indeed complicated but it works. Applying it to this code
with (namespace "my-namespace-")
, would prefix namespacer with
my-namespace-namespacer
. I could paste the code but would be merely
a one line change.
The sad thing is that you can do this with a macro and probably be safer but this shows how you can manipulate the code using the tokens. This is more of a PoC than anything else. There is too much boiler plate just to change the function name, there might be a better paradigm. Esprima anyone?
Conclusion
So nothing much to be impressed… for now. I haven't finished the book yet but now I have a tool to apply some analysis for Elisp. I'm still looking for ideas about analyzing code at face, not by context. Code analysis has been done too much, what I am looking at is how code can be analyzed as a paragraph or as a poem. Automatic summarization and language processing, is there an analogy for code? We have code generation and other stuff but how about things that tell you about the structure or abstract.
Maybe I am talking in riddles but I can say this is a fun project although will little returns. Hmm… I wonder what else I can analyze about the tokens or there proximity.
I am a writer. I am a coder. I am a writer and a coder