szigetvári's
affiliation * availability * courses * papers * vita * etcetera * newpages
GNU * TeX * EPD * Hungarian * links
diacritics * sorting * corpuses


sorting Hungarian text

sort programs sort lines in alphabetical order (or any other way you manage to tell them). However, they typically base the ranking on ASCII character codes, that is, A, B ... precede a, b ...; z precedes á, é ....

The first problem is usually overcome within the sort program itself: for example, sort -f in Unix ignores case. You can also find out how to and set the locale on your computer to get accented characters to the right place. By letting your operating system know that you want to use the Hungarian conventions, á will be ordered between a and b.

What justifies the perl scripts provided here, however, is that the conventions of some languages - Hungarian among them - require that some multi-letter graphemes be ordered elsewhere than expected. In Hungarian these are cs, dz, dzs, gy, ly, ny, sz, ty and zs. These count as units ordered after their first part, that is:

cucc < csap, gzip < gyík, zûr < zsír...

Unfortunately, the task is not trivial: some sequences that look like multi-letter graphemes are in fact not, e.g., bércsík may be ranked before or after bérczerge depending on its morphology: bér+csík (after bérczerge) or bérc+sík (before bérczerge). This can be decided only with a morphological/semantic parser, which is probably not worth doing because the problem practically never turns up.

What you do then is the following:

  1. convert your text to Prószéky encoding
  2. convert it by p2abc
  3. sort it folding case and ignoring non-letters (sort -fd in Unix)
  4. convert it back to Prószéky encoding by abc2p
  5. convert it to your encoding
This looks like a lot of typing but you can alias the whole process. For example, you may put the following line in your config file:

alias husort='p2abc | sort -fd | abc2p'

[Note that this does not work like sort: husort myfile does not do what you want. Use cat myfile | husort instead; you could write a fancy shell script to make it work like that though.]

If your files are typically iso-8859 ecoded extend the alias as

alias husort='iso2p | p2abc | sort -fd | abc2p | p2iso'

etc.

[back to top]


© Péter Szigetvári, page last touched Sun Jan 27 00:01:25 CET 2002  best viewed with any browser