Scala: first impressions

January 28th, 2011

Scala keeps coming up as a name more and more frequently, so I thought I would check it out.

What interested me is that it is apparently good for multi-core processors, making threading easier. Since threading is I feel an upcoming challenge as the frequency of cpu cores freezes, and their number grows instead, this sounded interesting to me.

First impressions:

  • - It is still young. Searching on Tiobe put scala somewhere around 30th to 50th place, which is pretty low :-P
  • + I like the operator overloading
  • + It does all the ‘easy’ bits of erlang fp, Haskell fp, python fp, ie the maps, filters, immutable lists. All good fun. Writing a for loop is pretty easy
  • - tooling support, ie for eclipse seems a bit hit and miss at the moment. It didn’t really work for me, running Helios
  • + it is relatively easy to mix java and scala code
  • - … but the tooling support for doing so, in eclipse doesn’t seem to me to be really there yet
  • - auto-completion in eclipse didn’t seem to work for method parameters, ie if I do ‘println(fo’ then ctrl-space it won’t offer me the foo variable name

I want to like scala, and I do like Scala so far. It seems to me difficult to justify its use in normal everyday projects at the moment. Specifically: the eclipse tooling seems to me a bit hit and miss to me at the moment. I may modify my point of view when I’ve played with scala for a bit longer than the 4 hours I’ve played with it so far :-P

Decent scala tutorial: http://www.artima.com/scalazine/articles/steps.html

EasyInjector: handles cyclic dependencies

January 11th, 2011

I’ve been experimenting with pico recently, but found that having to eliminate all cyclic constructor dependencies from my project was really slowing down development, and didn’t I feel seem intuitive.

After all, in a normal company, normally for example the testing department can contact the development department, and the development department can contact the testing department? It is not clearly that one is a client of the other, though they are both clients of some higher power.

So, I wrote EasyInjector which allows cyclic dependencies.

Singletons, Dependency Injection, Pico

December 28th, 2010

So my current project got to around a hundred classes, and singleton hell was starting to kick in. Singleton hell meaning: tricky to unit test.

In the past, I’ve either kept the singletons, and added swap_instance methods to the singleton classes for unit-testing, or created a God object that I pass around everywhere.

This time, after some googling I’m trying Pico dependency injector.

The website is excellent picocontainer.org. The tutorials are very easy to understand and concise, which is a rare combination.

I’ve finished migrating my project over, and so far no huge issues. Except one, minor issue. Which is: how do you create new objects, from within one of the pico-constructed objects?

Imagine you want to make an object of class X, and you need to pass in some parameters to the constructor, and you want to do this after the initial top-level bootstrapping phase.

One could use the pico container to make a new object, but this seems to me to be ugly, and introduces a dependency on Pico, which it might be nice to avoid, if possible

One could create the new object, passing in all the dependencies by hand. This sounds like a total PITA, and precisely what we are trying to avoid by using dependency injection.

My current tentative solution:

- add a ‘clone’ method to the class we want to create an instance of
- the clone method creates a new instance of itself, and passes it back
- the calling class declares an instance of this class as a dependency
- and when it wants to make a new instance of htis class, it calls ‘clone’ and uses that.

Lastly, for the parameters that the calling class wants to add, I’ve moved those to a separate ‘init’ method.

The final result for the calling class:

class MyClass {
   final SomeOtherClass someOtherClassTemplate;
   public MyClass(SomeOtherClass someOtherClassTemplate){
       this.someOtherClassTemplate = someOtherClassTemplate;
   }

   // ...
   SomeOtherClass someOtherClass = someOtherClassTemplate.clone().init("foo","blah",123);
   // ...
}

Not beautiful, not perfect, but works for now, and seems better than the other altnertiaves I thought of or read about so far. I imagine there is a standard solution, and will likely find out about it sometime.

As for dependency injection itself, so far so good. Time will tell.

Idea: compile-time permissions on java classes

December 20th, 2010

Idea: compile-time permissions on java classes

Problem:

- I want to be able to control access to my classes, which generally makes making most data read-only, through get accessors
- I want to be able to serialize them, which means either adding all the set accessors too, or using reflection

One solution is to pass in a data bean just used for serialization into a class, which populates the data bean, and returns it to the serialization code. I don’t like this much, because it is extra code: more work, more maintenance, more debugging.

Using reflection works on this occasion. However, in the general case, it seems to me to be a weakness in the language to need to use reflection as a workaround.

I’m probably being naive here, and there’s probably a bunch of stuff in the literature about this already, but I can’t help thinking that it could be useful to be able to assign permissions to java classes, that are checked at compile time.

Something like:

@Serialization // add arbitrary annotation to signify group for permissions
class MySerializer {
}

class MyClass {
    @AllowedCallers(Serialization.class)  // assign permissions, only allow classes with @Serialization annotation to use this method
    public void setSomething(int value){
        this.value = value;
    }
}

Array iteration optimization in Java

December 17th, 2010

The problem

I am working on a project right now which involves lots of loops that look like:

			for( int x = 0; x < size; x++ ) {
				for( int y = 0; y < size; y++ ) {
					for( int z = 0; z < height; z++ ) {
						if( somearray[x][y][z] == somevalue) {
							// do something
						}
					}
				}
			}

Writing out these loops by hand got tedious, error-prone, high-maintenance, and makes the code longer.

If I was writing in C++, I could macro it out.

In Java, for reasons I tend to agree with, there are no macros, not an option. Same deal in C#, as far as I know.

Idea One: anonymous classes

My first idea was to write a class which would use an anonymous class to call back into my code. I figured one could use it like this:

new Iterator(startVector3i, endVector3i).iterate( new Callback(){
   public void callback(Vector3i next ){
      // do stuff here
   }
));

Technically this is possible, except the code inside has no access to our local variables in our calling method, which makes it not terribly useful I feel.

Discovery: using an iterator class surprisingly slow

Next, I made an iterator class that worked like this:

ArrayIterator iterator = new ArrayIterator(startVector3i,endVector3i);
while( iterator.next() ){
    if( somearray[iterator.x][iterator.y][iterator.z] == somevalue ) {
        // do something
    }
}

This took a long time to execute. Compared to the bog-standard nested for-loops we started with, it took 20 times longer to execute!

Why?

Deduction: java optimizes processor cache requests in nested for loops

It baffled me why this approach was so slow. Surely a method call is not so expensive?

I tried all sorts of different performance tests, and in the end found the following very specific code sample:

			for( int x = 0; x < size; x++ ) {
				for( int y = 0; y < size; y++ ) {
					for( int z = 0; z < height; z++ ) {
						if( somearray[x][y][z] == somevalue) {
							// do something

This runs quickly.

			for( int x = 0; x < size; x++ ) {
				for( int y = 0; y < size; y++ ) {
					for( int z = 0; z < height; z++ ) {
						if( wrapperObject.getArrayValueAt(x,y,z) == somevalue) {
							// do something

This runs 20 times slower, even though normal simple method calls were not particularly expensive I found.

My conclusion is that when we iterate over an array with an obvious in-lined for loop, java/jvm/processor has enough information available to realize it can optimize by fetching a batch of values from the array all at once.

When the array access is not directly inside the nested for loop, this doesn’t work.

So, my conclusion is that it seems any iterative access to large arrays needs to be explicitly inlined with the iterating loop in the code.

Edit: found a relevant sun wiki page that I think explains this effect

http://wikis.sun.com/display/HotSpotInternals/RangeCheckElimination

Basically, a bunch of range checks are carried out on array access, and when the access is done in a for loop, many of these checks are eliminated, where the loop body is inlined.

My track-record so far for number of words of mandarin

October 14th, 2010

Results so far of estimating the number of 2-character words I know in mandarin, ie can give an appropriate definition for each word. See Estimating number of words known in mandarin (or any language really).

Oct 9: 2689
Oct 11: 1758
Oct 12: 2638
Oct 14: 3224

It’s bouncing around all over the place, so I guess with such a small sample (50), it’s not terribly accurate. Not even to 10%. More like accurate to +/- 1000 it seems.

The Oct 14 data used 100 words, rather than 50, to try to reduce the instability somewhat, though it still looks like a bit of an outlier.

Electronic hand-held dictionaries can read handwriting now! Great for learning Chinese.

October 9th, 2010

Electronic hand-held dictionaries can read handwriting now! Great for learning Chinese.

I’m not going to name brands or models; this is not an advert! They are really great though.

When I first tried to read Chinese five years ago, using a dictionary was tedious in the extreme. It took about 3 minutes to look up a single character, by which time the flow of the text had long gone.

Here’s what we used to have to do:
Look up the key:
- count the number of strokes in the character’s key
- the key is the bit of the character on the left
- sometimes the key isn’t really obvious
- turn to the index of keys, and turn to the page with keys having the number of strokes of our character
- then read through each key on that page to find our own key
- this might take a minute or two, up to 5 minutes if we picked the wrong key, or miscounted the strokes

Look up the character:
- the key index will give us the page number of the character index containing characters with our key
- at this point, we need to count the strokes in the whole characters. Typically there are about 8ish. Sometimes the number of strokes isn’t really clear: is that line two strokes, or just one?
- now we turn to the section corresponding to our key and the number of strokes in our character
- … and hunt for our character
-> this gives us the pinyin for our character. Pant. Huff. But we’re not there yet…

Now, finally, armed with the pinyin, we look up the pinyin in the main bulk of the dictionary, and look up the actual meaning of our character…

… by which time we’ve long forgotten what was in the text, not to mention the five minutes this takes is a total waste of time, teaches us nothing, is just gone from our life permanently, uselessly.

Here’s how it works on the electronic dictionaries:
- turn it on. One button. 1 second
- take out the stylus. 1 second
- pick a dictionary, change to ‘handwriting mode’. 2 seconds?
- draw the character. 5 seconds
- see the meaning. That was easy…

Sometimes, we need to draw the character a couple of times before we draw it acceptably for the computer. It’s not wasted time though, since we learn to draw the character in the process.

I’m very happy with mine. Takes up a lot of space in my pocket, along with my phone, so my trousers bulge strangely. Other than that I’m very happy with it.

Estimating number of words known in mandarin (or any language really)

October 9th, 2010

I wanted to be able to measure my progress in mandarin. How many words do I know? How many words did I learn in the last week? Am I ever going to be able to learn enough words to communicate reasonably?

I could guess, or just keep going until magically I could speak, but I wanted something a little more objective.

I decided to take a stardict dictionary, run it through a script, and get the script to pump out 50 random 2-character words for me. Then I just go down each word and see which ones I know. Add up the number I get right, divide by 50 and multiply by the total number of 2-character words in the dictionary, and I can get an approximation for how many chinese 2-character words I know in total.

I can also do the same thing for single characters, only this time I write down the pinyin, instead of the meaning, and match the pinyin either ensuring the tone is the same, or not bothering, and thus obtaining two estimates for characters, with or without tones.

The results were a little depressing for me:
2-character words: 2690, out of 14790
characters, correct tone: 890, out of 6950
characters, ignoring tone: 1390, out of 6950

Seems I have a way to go… but on the other hand it’s not like I’m orders of magnitudes away from learning the dictionary either… and I have a way to sort of measure my progress now, to within a certain certainty level, which I haven’t calculated yet, but which I’m going to guess is around 5-10%?

The script I used:
- first, we take a stardict dictionary
– this has an .ifo, idx and .dict.(something) file
– copy the .dict.(something) file to .dict.gz, then gunzip it
– edit the .ifo file to change sametypesequence to just ‘m’, instead of ‘ym’, if it isn’t already. I don’t know why we have to do this, but if we don’t the next script fails
- ‘sudo apt-get install stardict-tools’, if you didn’t already
- /usr/lib/stardict-tools/stardict2txt (ifo filename)
-> this gives you a text file with the dictionary contents in, one entry per line
- now we can use the following script to read the lines from this text file, and pump out 50 2-character words, or whatever you want

#!/usr/bin/python
#
# Copyright Hugh Perkins 2010
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
#  more details.
#
# You should have received a copy of the GNU General Public License along
# with this program in the file licence.txt; if not, write to the
# Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
# 1307 USA
# You can find the licence also on the web at:
# http://www.opensource.org/licenses/gpl-license.php
#

import sys
import os
import random

dicfile = sys.argv[1]
print "dicfile: " + dicfile
dicfile = open( dicfile )
lines = dicfile.readlines()
dicfile.close()

numlines = len(lines)
print "number lines: " + str(numlines)

print "how many characters per word?"
numchars = int(sys.stdin.readline().strip().lower())
print "num chars per word: " + str(numchars)

totalwords = 0
for thislineindex in xrange(0,numlines):
   thisline = lines[thislineindex]
   chineseword = thisline.split("\t")[0]
   chinesewordutf8 = chineseword.decode("utf-8")
   if len(chinesewordutf8) == numchars and not "(place in " in thisline and not "(city in " in thisline and not " County, in " in thisline and not "(county in " in thisline:
      totalwords = totalwords + 1

print "total available words of " + str(numchars) + " characters: " + str(totalwords)

print "how many words to pick?"

numwordstoget = int(sys.stdin.readline().strip().lower())
print "getting " + str(numwordstoget) + " words"

for i in xrange(1,numwordstoget+1):
   gotword = False
   while not gotword:
      thislineindex = random.randint(0,numlines)
      thisline = lines[thislineindex]
      chineseword = thisline.split("\t")[0]
      chinesewordutf8 = chineseword.decode("utf-8")
      if len(chinesewordutf8) == numchars and not "(place in " in thisline and not "(city in " in thisline and not " County, in " in thisline and not "(county in " in thisline:
         print str(i) + " " + thisline
         print str(i) + " " + chineseword
         gotword = True

To use it:
python wordcount.py (path to dictionary text file)

What can be improved in Windows 7? Add app store

August 20th, 2010

I bought a pentium su4100 based netbook recently. It comes with Windows 7, and I must say it is a refreshing change to just be able to run games without spending 3 days full-time trying to tweak the wine configuration to try to get something passably working…

And Windows 7 works generally rather well I feel. Hasn’t crashed on me once yet. Microsoft produces a free real-time virus and malware scanner now.

So, what is there left to improve in Windows 7 for future versions of Windows? Off-the-top-of-my-head, I came up with two things:
- make other things become built-in, available at no extra cost, eg Office
- add an application store, a one-stop shop to buy everything, including everything currently available from Steam, battle.net, and so on
– no more need to wonder whether something one is downloading is legitimate or not
– downloads would be easy and secure by default, ie make sure they use https and so on
– could provide an option to either buy software outright, or lease it for an hour, 24 hours, a month, or whatever
– or whatever is done for applications in the iPhone app store, which seems to be very successful

I suppose that if everything were made available through an appstore, then that would be a great whitelist for 99% of end-users, and all other applications could be entirely banned from running, eliminating a lot of viruses, rootkits and so on.

Converting CantoFish dictionary for stardict

April 11th, 2010

Stardict is an awesome dictionary utility for linux, with several dictionaries available for mandarin.

I started to learn cantonese recently, and I couldnt find any stardict dictionaries for cantonese.

There is a plugin for Firefox called Cantofish which uses the dictionary data from CantoDict, and adsotrans.

Unfortunately, for some reason CantoFish doesnt run on my machine.

And also, stardict’s mouse-over translation is really awesome.

So I took a look at converting Cantofish’s dictionary into stardict format, which turned out to be pretty easy.

Cantofish’s dictionary is in the firefox profile, in extensions/cantofish@cantofish.net/chrome/content/canto.dat

As far as I know, this data is available under the GPL, or possibly under a non-commercial attribution license (eg adso).

canto.dat is a tab separated text file, which is really easy to read. I was pleasantly surprised by this!

Then, stardict dictionaries can be created using tabfile, which is in stardict-tools, from an appropriate tab-separated file.

I used the following script to convert canto.dat into an input file for tabfile, and then the rest is easy:

#!/usr/bin/python

import sys
import os
import string

def go( cantopath, outpath ):
	print cantopath
	cantofd = open( cantopath, "r" )
	outfd = open( outpath, "w")
	firstline = True
	for line in cantofd.readlines():
		if not firstline:
			line = line.strip()
			#print line
			cantocharacters = line.split(" ")[0]
			#print cantocharacters
			cantopronunciation = line.split("[")[1].split("]")[0].strip()
			#print cantopronunciation
			trans = string.join( line.split("]")[2:],' ').replace('/', '\n').strip().replace('\n', '\\n')
			#print trans
			outfd.write( cantocharacters + '\t' + cantopronunciation + '\\n' + trans + '\n' )
		firstline = False
	outfd.close()
	cantofd.close()

go( sys.argv[1], sys.argv[2] )

The script expects the path of canto.dat as the first argument, and the name of the output file as the second.

Then you can just process the output using tabfile, and copy the resulting files into an appropriate subfolder of /usr/share/stardict/dic/dic.