Embedding Graphviz Graphics in HTML Pages

August 25, 2012

You can also read this post on Github.

Motivation

Recently, the team I am working with is focusing on documenting a
complex online advertising system we developed. As you know, in
design docs, we would have figures like system architect, data flow
and etc. As programmers, we don’t want to draw if we can code. So
it would be great to have Graphviz dot graphics embedding in our
documents in HTML format.

This article describes a solution composed of Emacs-based editing
environment, an Ajax client and a Racket server that invokes Graphviz
for graphics rendering. The source code is on
Github.

An illustration of the HTML source and its rendering result is as
follows:
embedded-graphviz.png

Editing

We document in HTML format, so we want to embed the Graphviz source
code in a <pre> tag. However, most HTML editors does syntax-highlight
according to HTML format, thus cannot highlight the embedded Graphviz
source code. I found an easy solution is to use Emacs (sorry for
others do not use Emacs), as it is highly customizable.

The solution is to use multi-web-mode.el, a minor mode that selects
the appropriate major mode automatically according to where your point
is. For example, the following configuration says that if your
pointer is in a section beginning with the regular expression
"<pre +type=\"text/graphviz\"[^>]*>" and "</pre>", the
graphviz-dot-mode will be used.

;; Multi Web mode
(load-file "~/.emacs.d/graphviz-dot-mode.el")
(add-to-list 'load-path "~/.emacs.d/")
(require 'multi-web-mode)
(setq mweb-default-major-mode 'html-mode)
(setq mweb-tags '((php-mode "<\\?php\\|<\\? \\|<\\?=" "\\?>")
                  (js-mode "<script +\\(type=\"text/javascript\"\\|language=\"javascript\"\\)[^>]*>" "</script>")
                  (css-mode "<style +type=\"text/css\"[^>]*>" "</style>")
                  (graphviz-dot-mode "<pre +type=\"text/graphviz\"[^>]*>" "</pre>")))
(setq mweb-filename-extensions '("php" "htm" "html" "ctp" "phtml" "php4" "php5"))
(multi-web-global-mode 1)

You can copy-and-paste above configuration into your ~/.emacs file. It
assumes that you have downloaded
graphviz-dot-mode.el
and multi-web-mode.el,
put put them into your ~/.emacs.d directory.

The following two screenshots show the difference when you put your
point on the Graphviz source code and on the HTML source code:

multi-web-mode screenshot

Rendering

Given an HTML page with Graphviz source code embedded in a <pre>
tag, I wrote an Ajax program to send the source code to a server
(graphviz-server). The server returns an <img> tag pointing to the
result image file, the Ajax program then insert this tag in
author-specified place in the HTML page.

The following demo HTML code snippet shows how we specify where to
place the <img> tag, and how to invoke graphviz-server:

</head>
  <script type="text/javascript" src="./graphviz-client.js"></script>
</head>
<body onload="mcDrawGraphviz('graphviz_source', '/graphviz/', 'dot_img')">
  <div id="dot_img">
    <pre type="text/graphviz" id="graphviz_source">
        digraph finite_state_machine {
        ... (more graphviz source code) ...

The <head> tag includes the Ajax program graphviz-client.js, which,
as written in the <body> tag, will be invoked when the page is
loaded. The mcDrawGraphviz function grabs source code from the inner
text of tag with element id "graphviz_source", which is the <pre>
tag blow, and makes a POST HTTP request to the server listen on
"/graphviz", the puts the returned <img> tag inside the <div>
tag with element id "dot_img".

Serving

To respond to the Ajax client, we need a server program, which invokes
the Graphviz suite and renders the POSTed Graphviz source code. I
guess there have been such servers written in Python, Ruby and other
rapid-programming languages. Still, I wrote one using the Racket
language (a dialect of Scheme) as a practice.

Racket has a Web-server programming framework, which makes me think
about Rails for Ruby. But I do not use it. My code is a tweak of
this
example program. It is
a multi-threading server, and each worker thread creates a sub-process
to invoke Graphviz for rendering, if it cannot found a previously
rendered image.

It is notable that the Ajax client cannot access graphviz-server via
its IP and port; instead, due to the security policy of Web browsing,
the Ajax client Javascript program, which was downloaded from the
document server together with the HTML page, can access only URIs
pointing to the same server (the document server), but not the
graphviz-server.

More accurately, when the Javascript program makes an XMLHTTPRequest
object sending HTTP request to a server other than where the
Javascript program was downloaded, the XMLHTTPRequest object will send
an OPTIONS request before the real request. If server does not
respond security policy matching the OPTIONS request, the
XMLHTTPRequest object won’t send the real request. This is complex,
so we want to avoid it.

A solution is to setup the document server using Nginx, and make
graphviz-server an upstream server of Nginx. When the Ajax client
accesses a certain URL of Nginx, Nginx proxies the request to
graphviz-server. From the perspective of Ajax clients, there is only
one server — the Nginx server. The following Nginx server
configuration file shows how to do this:

server {
  listen       80;
  server_name  graphviz.server;
  root /Users/wyi/Projects/graphviz-server;

  autoindex on;

  location / {
    index index.html index.php;
    try_files $uri $uri/ @backend;
  }

  location /graphviz/ {
    proxy_pass http://graphviz.server:9981;
  }
}

This defines an Nginx virtual server with name graphviz.server, which
listens on port 80. When you access http://graphviz.server/, the
server returns content of local directory
/Users/wyi/Projects/graphviz-server. (You might want a better name
such as ~/blog.) If you access http://graphviz.server/graphviz/,
your request will be proxied to graphviz-server running on the same
computer and listening on port 9981.

Indeed, when you access graphviz-server from a Web browser, a GET
request would be sent, and the server returns an HTML page with usage
information. The Ajax client program sends POST requests, in which
case, the server would do rendering work and return an <img> tag.

To make above configuration work, you need to put it into a file, say,
document_server.nginx.conf, and add an include directive to your
Nginx configuration file:

include “the/path/to/document_server.nginx.conf”;

Also, you need to modify your DNS settings to assign graphviz.server
an IP address. For development and testing, just add the following
line into your /etc/hosts file:

127.0.0.1    graphviz.server

Setting Up

The following steps help you build graphviz-server:

  1. Download and install Racket on your
    development computer, which could be the same computer as your
    document server. You need it to compile graphviz-server into
    native binary.
  2. Check out graphviz-server
    source code to your
    development computer:

    git clone https://github.com/wangkuiyi/graphviz-server.git

  3. Build graphviz-server:

    raco exe graphviz-server.rkt && raco distribute build graphviz-server

    The binary (graphviz-server) and related libraries will be placed
    in subdirectory build. You should copy build to somewhere on your
    document server.

The following steps help you set up your document server:

  1. On the document server, create a document directory. Mine is
    /Users/wyi/Projects/graphviz-server. You might want a better
    name such as ~/blog.
  2. Move graphviz-client.js and graphviz-demo.html your checked out
    from above Github repository into ~/blog/. You won’t edit
    graphviz-client.js, but you might want to write your pages
    similar to graphviz-demo.html.
  3. On the document server, install Nginx. On Debian-like Linux
    distributions, you can use sudo apt-get install nginx On Mac OS
    X, you can use Homebrew brew
    install nginx
  4. Move nginx.conf you checked out into ~/blog/blog.nginx.conf,
    and add a line to your Nginx configuration file to include
    ~/blog/blog.nginx.conf. You might want to change the
    server_name directive in blog.nginx.conf to use the domain name
    of your document server, or you might want to edit your hosts file
    to assign your localhost a better name as you use it as the
    document server.
  5. Start Nginx using sudo nginx –s start, or restart it using sudo
    nginx –s restart
    .
  6. Start the graphviz-server

    /path/to/build/bin/graphviz-server –d ~/blog –u /blog -p 9981

    The -d parameters specifies a local directory holding the
    generated PNG images. The -u parameter specifies a URL prefix
    when returning the PNG image URL. For example, a generated PNG
    image ~/blog/xxyyzz.png will be returned as
    http://graphviz.server/blog/xxyyzz.png. The paramters -p
    specifies the port on which the server listens. Please make sure
    the port is the same as specified in ~/blog/blog.nginx.conf.

Testing

Now, it is time to check your setup. Open a browser, and try entering
the following URLs:

  1. http://graphviz.server/
    You should see your documents in ~/blog/.
  2. http://graphviz.server/graphviz/
    You should see the hello page of graphviz-server.
  3. http://graphviz.server/graphviz-demo.html
    You should see an HTML page with a Graphviz-generated PNG image embedded.

工业界水平的机器学习系统

August 15, 2012

拜读了林智仁老师昨天在KDD 2012上的talk:Experiences and Lessons in DevelopingIndustry-Strength Machine Learning andData Mining Software,深有感触。印象最深刻的两点是:

  1. 软件质量要高。林老师举了一个数值计算精度的例子。之前林老师在Google中国做访问研究的时候,举过更多例子。
  2. 功能不可过于繁杂。要了解用户需求,帮助用户做选择;不要为追求accuracy的少量提高,而付出系统稳定性下降的代价。

林老师是做数值计算出身的,对代码质量的要求天然高。在工作中不断分析和响应用户需求,把libsvm和liblinear打造得很稳定很强大。同时不断研究新的方法,并着力推进新方法在工业界的应用,为此经常在业界公司做访问研究。这些工作方法都是非常值得我们在公司里做算法研究的人学习的。

在我的实际工作中,也碰到一些学界不大注意的问题,和林老师指出的类似。一个例子是在做广告的点击率预估时,大量训练和测试的时间是消耗在feature construction上,而不是数值算法部分。为此,在并行(或者并发)的时候,这部分计算应该是要尽量分布(distribute)的。

另一个例子是很多研究为了得到更高质量的latent topics,而在LDA之类的模型里增加random variables。而很多这样的模型是不容易有高效的并行训练算法的。另一方面,也有很多改进方式是不影响训练过程的并行化的,并且改进方式更“根本”。我们工作中发现了这些情况,应该向学界反馈。

我想我们有两个方向的努力要做:

  1. 重视基础理论知识的积累。这样在遇到实际工程问题的时候,才能找到尽量”完备”和“美”的解法。
  2. 用工业化的标准要求自己开发的系统。积极总结业内经验,向学术界反馈。

在公司里做算法的人,应该像林老师学习,成为跨越学界和业界的桥梁。


Follow

Get every new post delivered to your Inbox.