Convert Chinese LaTeX Source to HTML and PDF

June 27, 2014

I am going to write a book in Chinese. I hope that I can publish its chapters on my blog, so I can get feedback before it is printed. This requires that I can convert my manuscript into HTML format (for publishing in my blog) and PDF format (for printing).

I tried to write in Emacs Org mode, Wiki and Markdown. However, none of them support equatons well. So I decided to use LaTeX.

I tried several tools to convert LaTeX source into HTML, including htlatex and pandochtlatex does not support Chinese well, and pandoc supports only few LaTeX syntax. Finally, I decided to use Hevea, which works good to me.

I use XeLaTeX to convert LaTeX to PDF. Compared with PDFLaTeX, XeLaTeX works better with UTF-8 and TrueType Chinese fonts.

However, Hevea and XeLaTeX have different requirements with the preamble of LaTeX source. So I created tempaltes for them respectively. These templates use LaTeX’s \input directive to include a LaTeX source file containing the real text.

An example project is at https://github.com/wangkuiyi/hevea-xelatex.


Configure an HDFS for Development/Testing

May 10, 2014

I am using the Go implementation of WebHDFS interface: https://github.com/vladimirvivien/gowfs. In order to test it, I need to set up an HDFS on my development computer (Mac OS X 10.8, Hadoop-2.2.0). The author Vladimir Vivien reminded two properties to enable WebHDFS:

  1. Enable dfs.webhdfs.enabled property in hdfs-site.xml
  2. Ensure hadoop.http.staticuser.user property is set in your core-site.xml.

However, those are not enough. If you see error messages like the following reported by the append operation:

Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.

You need to add the following properties in hdfs-site.xml

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
    <value>false</value>
  </property>

Editing CSV Files Using Emacs

May 9, 2014

CSV (comma-separated values) files are everywhere, though sometimes field are separated by tabs or spaces instead of comma. I see many people edit CSV files using spreadsheet software like Microsoft Excel. I edit CSV files using Emacs, so I do not have leave my programming environment.

Emacs can recognize CSV files if you installed the csv-mode: http://www.emacswiki.org/emacs-fr/download/csv-mode.el. Just download it and save it anywhere, and add the following line into your ~/.emacs file:

(load-file "/path/to/csv-mode.el")

Before you can edit your CSV file, make sure values were separated by comma. If they are not, the following simple command-line can help:

cat your_file | sed 's/\t/, /g' > your_file.csv

This command converts every tab in your_file into a comma and a space. The result file, your_file.csv, can be recognized by csv-mode now.

After opening your_file.csv using Emacs, you might want to use M-x toggle-truncate-lines to disable the warping of long lines. Then, you can use M-x csv-align-fields to align fields. This makes the file look like it is in Microsoft Excel.

Screenshot


How to Sample a Dirichlet-Multinomial Distribution

May 8, 2014

Consider the problem of sampling from a multinomial distribution Mult(\vec{x}|\vec{p}, n), where \vec{p} is sampled from a Dirichlet prior distribution Dir(\vec{p}|\vec{\alpha}).

A conceptually straight-forward solution is to sample \vec{p} from Dir(\vec{p}|\vec\alpha), and then to generated $\latex n$ samples from the discrete distribution defined by \vec{p}. As described by Wikipedia, sampling \vec{p}=\{p_1,\ldots,p_K\} can be done by drawing samples \{y_1,\ldots,y_K\} from K Gamma distributions: y_k \sim \Gamma(\alpha_k, 1) \text{,  } k\in[1,K], and then get \vec{p} by normalizing y_k: p_k = y_k/(\sum_k y_k). According to Wikipedia, if \alpha_k is a positive integer, we have \sum_{i=1}^{\alpha_k} - \log U_i \sim \Gamma(\alpha_k, 1), where U_i is a sample drawn from the uniform distribution over (0, 1]. However, if \alpha_k‘s are not positive integers, sampling Gamma would become a complex procedure.

Even if we can implement the algorithm that draws samples from Gamma and then Dirichlet, this algorithm would not be numerically robust. Consider that when U_i is close to 0, \log U_i would be Inf. Another dangerous point is that if we get successively K y_k=0, p_k=y_k/(\sum y_k) would lead to either divide-by-zero interrupt or make p_k NaN.

Fortunately, we can make use of the conjugacy between Dirichlet and multinomial. This conjugacy, as explained in the textbook Pattern Recognition and Machine Learning, states that \alpha_k is the prior number of observations of the multinomial output $k$. This leads to the following simple sampling method, which can be generalized further to sample from Dirichlet processes:

  1. \vec{p} = \vec\alpha, i = 0
  2. k \sim Discrete(\vec{p})
  3. p_k = p_k+1, x_k=x_k+1, i=i+1
  4. while i < n, goto 2.

Full Go code is as follows:

func sampleDirichletMultinomial(alpah []float64, n int, rng *rand.Rand) []int {
	dist := make([]float64, len(alpha))
	copy(dist, alpha)
	hist := make([]int, len(alpha))
	for i := 0; i < n; i++ {
		k := sampleDiscrete(dist, rng)
		dist[k] += 1.0
		hist[k]++
	}
	return hist
}

func sampleDiscrete(dist []float64, rng *rand.Rand) int {
	if len(dist) <= 0 {
		panic("sample from empty distribution")
	}
	sum := 0.0
	for _, v := range dist {
		if v < 0 {
			panic(fmt.Sprintf("bad dist: %v", dist))
		}
		sum += v
	}
	u := rng.Float64() * sum
	sum = 0
	for i, v := range dist {
		sum += v
		if u < sum {
			return i
		}
	}
	panic("sampleDiscrete gets out of all possiblilities")
}

Install GDB from Source on Mac OS X

April 21, 2014

It is OK to follow this tutorial to build GDB from source code:

  https://github.com/sirnewton01/godbg
 
But we need to apply a patch before ./configure and make as described in above link:
 
  cd gdb-7.7
  patch < ~/Download/patch // here we need to specify the file to be patched. It is bfd/mach-o.c
  ./configure –prefix=/Users/yiwang/usr –disable-dynamic –enable-static –enable-expact –enable-python
  make -j8
  make install

 


Extract Text from PDF Files

March 23, 2014

I got this solution from Stackoverflow.

A more comfortable way to do text extration: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:

 pdftotext \
   -f 13 \
   -l 17 \
   -layout \
   -opw supersecret \
   -upw secret \
   -eol unix \
   -nopgbrk \
   /path/to/your/pdf
   - |less

This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less…


Removing control-Ms (^M) in Text File using Sed

March 23, 2014

When I extract text from a database or a PDF file (using xpdf’s pdftotext), I got fields or words suffixed with special character ^M. Note that these ^M’s appear not only at the end of lines. I use the following command line to remove these annoying ^M’s:

sed ‘s/^M//’ my-text-file

where the ^M in above shell command line comes by pressing Ctrl-V and then Ctrl-M.


Follow

Get every new post delivered to your Inbox.

Join 29 other followers