Extract Text from PDF Files

March 23, 2014

I got this solution from Stackoverflow.

A more comfortable way to do text extration: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:

 pdftotext \
   -f 13 \
   -l 17 \
   -layout \
   -opw supersecret \
   -upw secret \
   -eol unix \
   -nopgbrk \
   /path/to/your/pdf
   - |less

This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less…


Removing control-Ms (^M) in Text File using Sed

March 23, 2014

When I extract text from a database or a PDF file (using xpdf’s pdftotext), I got fields or words suffixed with special character ^M. Note that these ^M’s appear not only at the end of lines. I use the following command line to remove these annoying ^M’s:

sed ‘s/^M//’ my-text-file

where the ^M in above shell command line comes by pressing Ctrl-V and then Ctrl-M.


Install YARN on Mac OS X

March 22, 2014

I’d been bothered by a whole bunch of problems when I tried to install Hadoop 2.2.0 (YARN) on my Mac OS X system. Thanks to Alex JF for the tutorial which works with both Linux and Mac OS X.


Programming Qt 5 Using Go

January 13, 2014

Salviati’s go-qt5 project on Github makes it possible to write GUI programs using Qt 5 and Go.

On Mac OS X Mavericks, I tried to build and run some Qt 5.2 programs written in Go. Here follows what I did:

  1. Install Qt 5.2 using Homebrew.
    brew update && brew doctor && brew install qt5
    This would warn you that Qt 4 is more widely used than Qt 5. Anyway, go-qt5 is a Go binding of Qt 5.
  2. Checkout go-qt5.
    mkdir -p /home/you/go-qt5
    export GOPATH=/home/you/go-qt5
    go get github.com/salviati/go-qt5
  3. Build go-qt5
    You can follow the README file on https://github.com/salviati/go-qt5 to build go-qt5 and some example programs. Since Homebrew does not create symbolic links for Qt 5, you need to invoke qmake as /usr/local/Cellar/qt5/5.2.0/bin/qmake. Also, you need to put $GOPATH/src/github.com/salviati/go-qt5/lib/libgoqt5drv.1.0.0.dylib together with your Go binaries in the same directory before you can execute the Go binaries.

RPC in Go: The Client Side

October 27, 2013

RPC in Go: The Client Side

Conceptually, each client maintains a connection to the server, on which, encoded requests and responses are transmitted, and a pending list, which maintains requests already sent out and waiting for responses.

In order to match responses with pending requests, every call is assigned a monotonically increasing sequential number. The pending list is in fact a mapping from the sequential number to a rpc.Call struct.

Every client has a goroutine that collects responses and match them with pending requests. If correctly matched, the goroutine notifies the caller about the completion and remove the matched request from the pending list.

The modification of the pending list is protected by mutex rpc.Call.mutex, thus avoids race condition causes by parallel request sending and response receiving.

The increment of the sequential number is protected by another mutex rpc.Call.sending, so concurrent sending does not make any confusion about the sequential number.

Establish the Connection

We can establish a TCP connection between the client and the server by calling rpc.Dial, which invokes net.Dial to establish a connection and calls rpc.NewClient to wrap up this connection as an RPC client, an rpc.Client struct.

We can also establish a TCP connection by using the CONNECT in HTTP protocol. This is done by calling rpc.DialHTTP, which invokes rpc.DialHTTPPath withDefaultRPCPath="/_goRPC"rpc.DialHTTPPath then invokes net.Dial to establish a connection and sends an HTTP request whose method is CONNECT and URI isDefaultRPCPath. The RPC server, which must be a HTTP server, should understand this request is to make a TCP connection for later RPC calls, it should keep that connection alive. Once the connection is established, a call to rpc.NewClient wraps up the connection by a client.

Encoding and Decoding

rpc.Client struct contains a member rpc.Client.codec with type rpc.ClientCodec, which wraps up the network connection. It encodes all requests and decodes all responses.

Client created by rpc.NewClient has a rpc.gobClientCodec codec, an implementation of the rpc.ClientCodec interface. It is also possible to specify another implementation by creating the client using rpc.NewClientWithCodec. The rpc.Dial* methods callrpc.NewClient, but rpc.NewClientWithCodec was called by jsonrpc.NewClient.

Indeed, rpc.NewClient invokes rpc.NewClientWithCodec, and the latter, before returning the client object, invokes method go client.input, which recieves respones of pending requests and notifies callers the completion of their calls.

  • Seems that there is not a CreateServerWithCodec on the server side. So how should I write a JSON RPC server?

Make Calls

Conceptually, every RPC call consists of the name of service and method, an argument and the reply. More than that an error might occur during the call and a done channel is used to notify the completion of the call. All these are described by the rpc.Call struct.

To make a call, we can all rpc.Client.Go, which requires the service/method name, the argument, the holder of reply, and the done channel. rpc.Client.Go encapsulate all these inputs into a rpc.Call struct, and gives it to the call to rpc.Client.send.

The rpc.Call struct has a method done, which, when invoked, notifies the completion of the call by sending the rpc.Call strucut itself to the done channel, which is of type chan *Call. As the done channel is provided by the caller, the caller is able to know the reply or the error once it reads an rpc.Call struct out from the channel.

Calling Patterns

The caller can provide a done channel to multiple RPC calls, and waits to read all responses from the channel. This can be very useful in some cases.

Consider that we have a bunch of downstream services, and it is OK if we get response from any of them. We can make calls like:

var clients []rpc.Client          // to the N=10 services.    
done := make(chan * rpc.Call, 10) // buffer size >=10 to avoid unnecessary blocking.
for _, c := range clients {  // make calls to all these instances.
  c.Go("AService", arg, reply, done)
}
call := <- done  // blocks until any instance replies.

Another common case is to collect information from a bunch of servers in order to build a request for the next stage of RPC invocation. This can be done by changing the last line of above code:

var req ARequest
for _, call := range done {
  if call.Error != nil {
    log.Fatal(call.Error)
  }
  req.arg[call.ServiceMethod] = call.Reply
}
// Make another RPC call with req as the argument.

Sending Request

The method rpc.Client.send is a critical section protected by mutexrpc.Client.sending. The method checks if the client is currently under closing or had been shutdown. If so, it calls call.done to finish the call before transimitting it over the network connection.

Otherwise, it assign a new sequential number to the call. This sequential number is used as the key when it adds the call to the pending list. Then the sequential number, together with service/method name and the argument are written to the server by callingclient.codec.WriteRequest.

Receiving Response

The goroutine created by rpc.NewClientWithCodec (which is invoked by rpc.NewClient) runs rpc.Client.input is in charge of collecting responses and matching responses with pending requests.

The response consists of a header and a body. If errors occur during decoding the header,rpc.Client.input terminates itself. Otherwise, the matched call is removed from the pending list.

Otherwise, if the response has no matching pending request, there might had been something wrong with the call to rpc.ClientCodec.WriteRequest, and the body should be read the discarded. Or, if the response header notifies some errors on the server, the body should also be read and discarded. Only when everything is alright, the body is read and decoded intocall.Reply.

rpc.Client.input correctly invokes call.Done with either an error from the server or with the reply.

Client Codec

Above procedure also explains why the interface rpc.ClientCodec interface contains the following methods:

  • WriteRequest(r *Request, body interface{}) error
  • ReadResponseHeader(r *Response) error
  • ReadResponseBody(body interface{}) error
  • Close

Method WriteRequest encodes and writes rpc.Request, which contains the sequential number and service/method name, and the argument (as body).

Method ReadResponseHeader reads and decodes rpc.Response, which contains the sequential number, service/method name and a possible error.

Method ReadResponseBody reads and decodes the reply.


Vagrant是并行系统开发者的福音

October 17, 2013

Vagrnat是一个工具,方便的配置和访问虚拟机。我们应该会经常用来配置统一的开发和测试环境。

只要机器上安装了VirtualBox和Vagrant,那么下面步骤可以帮助我们配置一个基本开发环境:

  1. vagrant box add p32 http://files.vagrantup.com/precise32.box 下载一个标准VirtualBox虚拟机镜像,放到~/.vagrant.d/boxes/p32目录下,以备使用。precise32.box是一个32bit Ubuntu机器。

  2. mkdir learn; cd learn; vagrant init p32 创建了一个目录 learn,在其中加入了一个配置文件 Vagrantfile,这个文件中指定使用p32这个镜像作为虚拟机基本镜像。Vagrantfile通常连同learn这个目录中其他源码文件一起加入到版本管理系统里。

  3. 修改Vagrantfile,指定一个配置脚本文件bootstrap.sh:

    Vagrant.configure("2") do |config|
      config.vm.box = "precise32"
      config.vm.provision :shell, :path => "bootstrap.sh"
    end
    

    并且在当前目录learn下编辑一个配置脚本bootstrap.sh,用来安装Apache

    #!/usr/bin/env bash
    apt-get update
    apt-get install -y apache2
    rm -rf /var/www
    ln -fs /vagrant /var/www
    
  4. vagrant up 启动一个虚拟机,执行镜像p32;启动之后,运行bootstrap.sh,安装Apache。

  5. vagrant ssh 通过ssh登录进入启动的虚拟机。虚拟机里有一个目录/vagrant是Host机器上learn目录(Vagrantfile所在目录的)镜像目录。所以虚拟机里可以方便的访问learn里所有的源码。

  6. vagrant suspend 休眠;vagrant resume唤醒;vagrant halt关机;vagrant up启动并且重新执行配置脚本;vagrant reload --provision重启一个已经启动着的虚拟机,并且执行配置脚本。

通过编辑Vagrantfile,我们可以让vagrant up命令启动好几个虚拟机,每个可以有不同的配置。然后把一个并行系统(比如一个在线广告系统)部署在这几台虚拟机上,并且执行。机群操作系统Skynet就是用这种方法做自动化测试的。


Cluster Name Service

October 17, 2013

一个复杂的互联网服务的后台,通常包括很多services程序;互相调用,构成一个有向无环图(directed acyclic graph)。通常被调用者被称为上游服务,调用者被称为下游服务。

通常,为了容错或者提高吞吐量(throughput),每个service程序会用来启动多个进程,每个进程称为一个instance。

当一个下游服务intance要调用一个上游服务的时候,本来应该只需要知道上游服务的名字(service name),因为连到任何一个instance都可以。可是,为了建立IP连接,需要知道上游服务的某个具体instance的IP地址和端口(简称instance address)。

一个直观的解决之道是利用一个全局名字服务(name service)维护一套从service name到其所有instance addresses的映射。这个名字服务必须很健壮,因为如果挂了,那么机群上所有services就都不能找到他们的上游服务了。幸运的是,我们可以利用Paxos协议来确保这个健壮性。

一个很典型的实现了Paxos协议的系统是Google Stubby。在开源软件中,用Java写的Zookeeper和用Go写的Doozer也都很好的实现了Paxos。

但是Stubby、Zookeeper和Doozer都不能算是name service——虽然它们可以稳定地维护任意映射,但是它们都不支持service name和instance的注册,也不能确定一个注册过的instance是否还活着。

Google有一个name service叫做GNS(Google Name Service),利用Stubby来维护serivce name到instances的映射,同时支持心跳机制(heartbeat):一个intance启动的时候把自己的address注册到GNS,并且每隔一段时间要声明一下“我还活着”,否则自己的address会被GNS清除,这样下游服务就不会拿到一个失效了的instance address了。

开源圈子里一直缺少这样一个name service,很多机群管理者不得已,在Zookeeper的基础上开发自己的name service。最近终于有人用Go语言写了一个——SkyDNS。这个服务并不需要以来Zookeeper,而是自己实现了Paxos协议来确保服务的健壮性。

SkyDNS启动之后会打开两个端口:一个HTTP端口和一个DNS端口。任何service instances都可以通过HTTP端口注册自己,另外任何程序都可以向访问标准DNS服务一样,通过提供一个service name查找service instance address。

SkynetDNS是Google之外的公司的福音——让我们可以在实际工作中,建设强大好用的并行计算机群。


Follow

Get every new post delivered to your Inbox.

Join 27 other followers