Category Archives: Miscellaneous

Do not trust thee RAID alone

I’m assuming that most readers are familiar with what RAID, the “Redundant Array of Inexpensive Disks”, is. Using RAID for disk redundancy has been around for a long time, apparently first mentioned in 1987 at the University of California, Berkeley (see also: The Story So Far: The History of RAID). I’m honestly not sure why they chose the term “inexpensive” back in 1987 (I suppose “RAD” isn’t as catchy of a name), but regardless of the wording, a RAID is a fairly easy way to protect yourself against hard drive failure. Presumably, any production server will have a RAID these days, especially with hard drives being as inexpensive as they are today (unless you purchase them list price from major hardware vendors, that is). Another reason why RAID is popular, is of course the fact that hard drives are probably the most common component to break in a computer. You can’t really blame them either, they do have to spin an awful lot.


Lesson #1: Don’t neglect your backups because you are using RAID arrays
That being said, we recently had an unpleasant and unexpected issue in our office with a self-built server. While it is a production server, it is not a very critical one, and as such a down-time of 1-2 days with a machine like that is acceptable (albeit not necessarily desired). Unlike the majority of our “brand-name” servers, which are under active support contracts, this machine was using standard PC components (it’s one of our older machines), including an onboard RAID that we utilized for both the OS drive as well as the data drive (it has four disks, both in a RAID 1 mirror). Naturally, the machine is monitored through EventSentry.

Well, one gray night it happened – one of the hard drives failed and a bunch of events (see for an example) were logged to the event log, and immediately emailed to us. After disappointingly reviewing the emails, the anticipated procedure was straightforward:

1) Obtain replacement hard drive
2) Shut down server
3) Replace failed hard drive
4) Boot server
5) Watch RAID rebuilding while sipping caffeinated beverage

The first 2 steps went smoothly, but that’s unfortunately how far our IT team got. The first challenge was to identify the failed hard drive. Since they weren’t in a hot-swappable enclosure, and the events didn’t indicate which drive had failed, we chose to go the safe route and test each one of them with the vendors supplied hard drive test utility. I say safe, because it’s possible that a failed hard drive might work again for a short period of time after a reboot, so without testing the drives you could potentially hook the wrong drive up. So, it’s usually a good idea to spend a little bit of extra time in that case, to determine which one the culprit is.

Eventually, the failed hard drive was identified, replaced with the new (exact and identical) drive, connected, and booted again. Now normally, when connecting an empty hard drive, the raid controller initiates a rebuild, and all is well. In this case however, the built-in NVidia RAID controller would not recognize the RAID array anymore. Instead, it congratulates us on having installed two new disks. Ugh. Apparently, the RAID was no more – it was gone -  pretty much any IT guys nightmare.

No matter what we tried, including different combinations, re-creating the original setup with the failed disks, trying the mirrored drive by itself, the RAID was simply a goner. I can’t retell all the things that were tried, but we ultimately had to re-create the RAID (resulting in an empty drive), and restore from backup.

We never did find out why the RAID 1 mirror that was originally setup was not recognized anymore, and we suspect that a bug in the controller firmware caused the RAID configuration to be lost. But regardless of what was ultimately the cause, it shows that even entire RAID arrays may fail. Don’t relax your backup policy just because you have a RAID configured on a server.

Lesson #2: Use highly reliable RAID levels, or configure a hot spare
Now I’ll admit, the majority of you are running your production servers on brand-name machines, probably with a RAID1 or RAID5, presumably under maintenance contracts that ship replacement drives within 24 hours or less. And while that does sound good and give you comfort, it might actually not be enough for critical machines.

Once a drive in a RAID5 or RAID1 fails, the RAID array is in a degraded state and you’re starting to walk on very thin ice. At this point, of course, any further disk failure will require a restore from backup. And that’s usually not something you want.

So how could a RAID 5 not be sufficiently safe? Please, please: Let me explain.

Remember that the RAID array won’t be fully fault tolerant until the RAID array is rebuilt – which might be many hours AFTER you plug in the repaired disk depending on the size, speed and so forth. And it is during the rebuild period that the functional disks will have to work harder than usual, since the parity or mirror will have to be re-created from scratch, based on the existing data.

Is a subsequent disk failure really likely though? It’s already pretty unlikely a disk fails in the first place – I mean disks don’t usually fail every other week. It is however much more likely than you’d think, somewhat depending on whether the disks are related to each other. What I mean with related, is whether they come from the same batch. If there was a problem in the production process – resulting in a faulty batch – then it’s actually quite likely that another bites the dust sooner rather than later. It happened to a lot of people – trust me.

But even if the disks are not related, they probably still have the same age and wear and, as such, are likely to fail in a similar time frame. And, like mentioned before, the RAID array rebuild process will put a lot of strain on the existing disks. If any disk is already on its last leg, then a failure will be that much more likely during the RAID array rebuild process.

raid6.pngRAID 6, if supported by your controller, is usually preferable to a RAID5, as it includes two parity blocks, allowing up to two drives to fail. RAID 10 is also a better option with potentially better performances, as it too continues to operate even when two disks fail (as long as it’s not the disks that are mirrored). You can also add a hot spare disk, which is a stand-by disk that will replace the failed disk immediately.

If you’re not 100% familiar with the difference between RAID 0, 1, 5, 6, 10 etc. then you should check out this Wikipedia article: It outlines all RAID levels pretty well.

Of course, a RAID level that provides higher availability is usually less efficient in regards to storage. As such, a common counterargument against using a more reliable RAID level is the additional cost associated with it. But when designing your next RAID, ask yourself whether the savings of an additional hard drive is worth the additional risk, and the potential of having to restore from a backup. I’m pretty sure that in most cases, it’s not.

Lesson #3: Ensure you receive notifications when a RAID array is degraded
Being in the monitoring business, I need to bring up another extremely
important point: Do you know when a drive has failed? It doesn’t help much to have a RAID when you don’t know when one or more drives have failed.

Most server
management software can notify you via email, SNMP and such – assuming
it’s configured. Since critical events like this almost always trigger
event log alerts as well though, a monitoring solution like EventSentry can simplify the notification process.
Since EventSentry monitors event logs, syslog as well as SNMP traps, you can take a uniform approach to notifications. EventSentry can notify you of RAID failures regardless of the hardware vendor you
use – you just need to make sure the controller logs the error to the
event log.

Lesson #4+5: Test Backups, and store backups off-site
Of course one can’t discuss reliability and backups without preaching the usual. Test your backups, and store (at least the most critical ones) off-site.

Yes, testing backups is a pain, and quite often it’s difficult as well and requires a substantial time commitment. Is testing backups overkill, something only pessimistic paranoids do? I’m not sure. But we learned our lessen the hard way when all of our 2008 backups were essentially incomplete, due to a missing command-line switch that recorded (or in our case did not) the system state. We discovered this after, well, we could NOT restore a server from a backup. Trust me: Having to restore a failed server and having only an incomplete, out-of-date or broken backup, is not a situation you want to find yourself in.

My last recommendation is off-site storage. Yes, you have a sprinkler system, building security and feel comfortably safe. But look at the picture on top. Are you prepared for that? If not, then you should probably look into off-site backups.

So, let me recap:

1. Don’t neglect your backups because you are using RAID arrays.
2. Use highly reliable RAID levels, or configure a hot spare.
3. Ensure you receive notifications when a RAID array is degraded
4. Test your backups regularly, but at the very least test them once to ensure they work.
5. Store your backups, or at least the most critical, off-site.

Stay redundant,

Continue reading

Curiosity Kills the Cat

25 years ago, on July 24th 1985, the Amiga 1000 was introduced in New York City (check out the ad). Coincidentally, the Amiga 500 was my first computer and I loved playing games on the Rock Lobster – despite the 7.15909 MHz processor. Well, those were the good old days, the days before mainstream email, the days before spam. Or were they? Believe it or not, in 1985 it had already been 7 years since the first spam email was sent by Gary Thuerk over the ARPAnet.


I don’t know about you, but 32 years later I still get spam delivered to my inbox on a daily basis, and that’s despite having 2-3 spam filters in place. What’s more, I still get legitimate email caught by the spam filter, mostly to the dismay of the sender.

Now, of course WE all know not to open spam – or to even look at it – as it will potentially confirm receipt (if you display images from non-trusted sources) and could also trigger malware (again depending on your email reader’s configuration).

But, we’ve all seen spam emails and I can’t help but wonder who actually reads these emails (for purposes other than to get a chuckle), much less opens them! Let’s not even think about who opens attachments or clicks links (yikes!) from spam emails.


The Facts

So WHO are those people opening, clicking spam? Well, turns out that the MAWWG, the Messaging Anti-Abuse Working Group determines exactly that (and presumably other things too) – every year. Better yet, they publish that information for our enjoyment.

It’s been a few months since the latest findings were published, but I’d consider them relevant today nevertheless (and a year from now for that matter).

In a nutshell, the group surveyed the behavior of consumers both in North America and Europe, and published key findings in regards to awareness, consumer confidence and so forth.

Before I give the link to the full PDF (see the Resources section below); here are what I think are some of the most interesting facts:

  • Half of all users in North America and Europe have “confessed” to opening or accessing spam. 46% of those who opened spam, did so intentionally to unsubscribe or out of some untameable sense of curiosity. Some were even interested in the products “advertised” to them!

    Bottom Line: 1 out of 4 people open spam emails because they want to know more, or want to unsubscribe.

  • In more detail, 19% of all users surveyed either clicked on a link from an email (11%) or opened an attachment from an email (8%) that they themselves suspected to be spam. I found that to be one of the most revealing numbers in the report.
  • Young users (under 35) consider themselves more experienced, yet at the same time engage in more risky behavior than other age groups. In Germany, 33% of all users consider themselves to be experts. Compare that to France, where only 8% of all users think they are pros.
  • Less than half of users think that stopping spam or viruses is their responsibility. Instead, they feel that the responsibility lies mainly with the ISP and A/V companies. 48% of all respondents do realize that it is their responsibility. The report doesn’t state whether this particular question, which lists 10 choices, was a multiple choice question.
  • When asked about bots, 84% of users were familiar with the possibility that software, say a virus, can control their computer. At the same time, only 47% were familiar with the terms “bot” or “botnet”.
  • On the upside, 94% of all users are running A/V software that is up-to-date, which is a comforting fact. I can only imagine that the remaining 6%, given Apple’s market share, account for most of the rest.

    My opinion: OS X users are probably still oblivious and don’t see the need to install A/V or any other type of security software on their computers. Still, some PC users apparently still don’t install AntiVirus/AntiMalware on their computers, despite many free options being available today.

Wow, that’s a lot of bad news to digest. So if I may summarize – the reason why we keep getting spam in our inboxes, is because every 5th person with a computer clicks on links or opens attachments (ah!) from spam emails, and because 6% of all users with a computer don’t run security software. Given the amount of people that dwell in the western hemisphere, that amounts to a lot of people.

Well, at least I know now why I keep getting those nuisance emails in my inbox. But somehow I don’t feel any better about them.

Training Day

I think what this report shows us the importance of user education. While people are apparently aware of spam, it doesn’t look like the average Joe is aware of the implications that a simple click in an email can have.

If you are reading this email, then you are probably a network professional working in an organization. With that, you have a unique opportunity to organize a simple workshop with your employees to educate them about the potential threats, and remind them that it’s not a good idea to do anything with suspect emails.


There is a wealth of information available on the web about educating users on spam and general computer security. We all know that software can only do so much – it’s a constant cat & mouse game between the researchers and the bad guys. It’s simply not possible, at least not today, to make the computers we use on a daily basis 100% secure.

While securing computers in a corporation is possible to some extent using whitelisting, content filters and such, doing the same thing for home computers is much more difficult. And it’s those computers that are most likely to be part of a botnet.

I can only imagine that the average user does not know that botnets can span thousands, if not millions, of computers. The Conficker botnet alone infected around 10 million computers and has the capacity to send 10 billion emails per day.

Let’s face it, the situation will not improve as long as people will click links in emails and open attachments from suspicious senders.

I encourage you to organize a training session with your users on a regular basis. If your organization is large, then you might want to start with the key employees first, and maybe create a tiered training structure.

Our Network is Safe

You might think that your network is safe. You have AntiVirus, white listing, AntiMalware, firewalls in every corner, web content filters and more. Scheduling a training sessions to tell your users on not to do the obvious, is probably the last thing on your mind.

But read on.

Risky behavior by your end users will not only affect global spam rates, but your organization as well. Corporate espionage is growing, and spies (whether they are from a foreign government or corporation) often use email to initially get access to an individuals computers. See SANS Corporate Espionage 201 (PDF) for some techniques being employed.

For example, pretty much every organization has people working from home. If a malicious attacker can compromise a home computer that is used to access a corporate network (even if it’s just used to access emails) and install a key logger, then they will most likely have gotten access to your corporate network. Once they have their foot in the door, it’s only a matter of time.

There are plenty of resources available on the net on how to educate users on security, spam and so forth. A short training session of 20 minutes is probably enough. The message to convey is simple, and if you keep a few points in mind the session can even be fun. Consider the following for the training session:

  • Be sure to interact with your users. Start off by asking them if they use A/V software or AntiMalware software at home.
  • Tell them about botnets, and if they would be happy knowing that their computer is part of a 10 million botnet controlled by people in the Ukraine.
  • Be sure to explain that a single users actions can compromise their corporate network.
  • Explain that technology cannot provide 100% security against intruders.

Of course, user education alone is not the answer to solving security problems like viruses, phishing and the like. Encryption, digital signatures (especially for corporate emails), white-listing all should be employed regardless of user education.


2010 MAAWG Consumer Survey Key Findings Report (6 pages)
2010 MAAWG Consumer Survey Full Report (87 pages)

Using Cartoons to Teach Internet Security
Get IT Done: IT pros offer tips for teaching users

UNICODE – ONE code to rule them all

If you live in an English-speaking country like the United States, United Kingdom or Australia, then you are in the lucky position where every character in your language can be represented by the ASCII table. Many other languages aren’t as lucky unfortunately, and it is no surprise given the fact that over 1000 written languages exist. Most of these languages cannot be interpreted by ASCII, most notably Asian and Arabic languages.

Take the text below for example, ASCII would be struggling with this a bit (to say the least):


Understanding UNICODE is no easy feat however – just the mere abbreviations out there can be mind-boggling: UTF-7, 8, 16, 32, UCS-2, BOM, BMP, code points, Big-Endian, Little-Endian and so forth. UNICODE support is particularly interesting when dealing with different platforms, such as Windows, Unix and OS X.

It’s not all that bad though, and once the dust settles it can all make sense. No, really. As such, the purpose of this article is to give you a basic understanding of UNICODE, enough so that the mention of the word UNICODE doesn’t give you cold shivers down your back.

Unicode is essentially one large character set that includes all characters of written languages, including special characters like symbols and so forth. The goal – and this goal is reality today – is to have one character set for all languages.

Back in 1963, when the first draft of ASCII was published, Internationalization was probably not on the top of the committee member’s minds. Understandable, considering that not too many people were using computers back then. Things have changed since then, as computers are turning up in pretty much every electrical device (maybe with the exception of stoves and blenders).

The easiest way to start is, of course, with ASCII (American Standard Code for Information Interchange). Gosh were things simple back in the 60s. If you want to represent a character digitally, you would simply map it to a number between 1 and 127. Voila, all set. Time to drive home in your Chevrolet, and listen to a Bob Dylan, Beach Boys or Beatles record. I won’t go in to the details now, but for the sake of completeness I will include the ASCII representation of the word “Bob Dylan”:

String:      B    o    b         D    y    l    a    n
Decimal:     66   111  98   32   68   121  108  111  110
Hexadecimal: 0×42 0x6F 0×62 0×20 0×44 0×79 0x6C 0x6F 0x6E
Binary:      01000010 01101111 01100010 00010100
             01000100 01111001 01101100 01101111 01101110

Computers, plain and simple as they are, store everything as numbers of course, and as such we need a way to map numbers to letters, and vice versa. This is of course the purpose of the ASCII table, which tells our computers to display a “B” instead of 66.

Since the 7-bit ASCII table has a maximum of 127 characters, any ASCII character can be represented using 7 bits (though they usually consume 8 bits now). This makes calculating, how long a string is for example, quite easy. In C programs for example, ASCII characters are represented using chars, which use 1 byte (=8 bits) of storage. Here is an example in C:

char author[] = “The Beatles”;
int authorLen = strlen(author);        // authorLen = 11
size_t authorSize = sizeof(author);    // authorSize = 12

The only reason the two variables are different, is because C automatically appends a 0×0 character at the end of a string (to indicate where it terminates), and as such the size will always one char(acter) longer than the length.

So, this is all fine and well if we only deal with “simple” languages like English. Once we try to represent a more complex language, Japanese for example, things start to get more challenging. The biggest problem is the sheer number of characters – there are simply more than 127 characters in the world’s written languages. ASCII was extended to 8-bit (primarily to accommodate European languages), but this still only scratches the surface when you consider Asian and Arabic languages.

Hence, a big problem with ASCII is that is essentially a fixed-length, 8-bit encoding, which makes it impossible to represent complex languages. This is where the Unicode standard comes in: It gives each character a unique code point (number), and includes variable-length encodings as well as 2-byte (or more) encodings.

But before we go to deep into Unicode, we’ll just blatantly pretend that Unicode doesn’t exist and think of a different way to store Japanese text. Yes! Let us enter a world where every language uses a different encoding! No matter what they want to make you believe – having countless encodings around is fun and exciting. Well, actually it’s not, but let’s take a look here why.

The ASCII characters end at 127, leaving another 127 characters for other languages. Even though I’m not a linguist, I know that there are more than 127 characters in the rest of the world. Additionally, many Asian languages have significantly more characters than 255 characters, making a multi-byte encoding (since you cannot represent every character with one byte) necessary.

This is where encodings come in (or better, “came” in before Unicode was established), which are basically like stencils. Let’s use Japanese for our code page example. I don’t speak Japanese unfortunately, but let’s take a look at this word, which means “Farewell” in Japanese (you are probably familiar with pronunciation – “sayōnara”):


The ASCII table obviously has no representation for these characters, so we would need a new table. As it turns out, there are two main encodings for Japanese: Shift-JIS and EUC-JP. Yes, as if it’s not bad enough to have one encoding per language!

So code pages serve the same purpose as the ASCII table, they map numbers to letters. The problem with code pages – opposed to Unicode – is that both the author and the reader need to view the text in the same code page. Otherwise, the text will just be garbled. This is what “sayōunara” looks like in the aforementioned encodings:


0xA4 B5 A4 E8 A4 A6 A4 CA A4 E9

0×82 B3 82 E6 82 A4 82 C8 82 E7

Their numerical representation between EUC-JP and Shift_JIS is, as is to be expected, completely different – so knowing the encoding is vital. If the encodings don’t match, then the text will be meaningless. And meaningless text is useless.

You can imagine that things can get out of hand when one party (party can be an Operating System, Email client, etc.) uses EUC-JP, and the other Shift_JIS for example. They both represent Japanese characters, but in a completely different way.

Encodings can either (to a certain degree) be auto-detected, or specified as some sort of meta information. Below is a HTML page with the same Japanese word, Shift_JIS encoded:

    <TITLE>Shift_JIS Encoded Page</TITLE>

    <META HTTP-EQUIV=”Content-Type” CONTENT=”text/html; charset=Shift_JIS”>

You can paste this into an editor, save it has a .html file, and then view it in your favorite browser. Try changing “Shift_JIS” to “EUC-JP”, fun things await you.

But I am getting carried away, after all this post is about Unicode, not encodings. So, Unicode solves these problems by giving every character from every language a unique code point. No more “Shift_JIS”, no more “EUC-JP” (not even to mention all the other encodings out there), just UNICODE.

Once a document is encoded in Unicode, specifying a code page is no longer necessary – as long as the client (reader) supports the particular Unicode encoding (e.g. UTF-8) the text is encoded with.

The five major Unicode encodings are:

UTF-16 (an extension of UCS-2)

All of these encodings are Unicode, and represent Unicode characters. That is, UTF-8 is just as capable as UTF-16 or UTF-32. The number in the encoding name represents the minimum number of bits that are required to store a single Unicode code point. As such, UTF-32 can potentially require 4 x as much storage as UTF-8 – depending on the text that is being encoded. I will be ignoring UTF-7 going forward, as its use is not recommended and it’s not widely used anymore.

The biggest difference between UTF-8 and UCS-2/UTF-16/UTF-32 is that UTF-8 is a variable length encoding, opposed to the others being fixed-length encodings. OK, that was a lie. UCS-2, the predecessor of UTF-16, is indeed a fixed length encoding, whereas UTF-16 is a variable length encoding. In most use cases however, UTF-16 uses 2 bytes and is essentially a fixed length encoding. UTF-32 on the other hand, and that is not a lie, is a fixed-length encoding that always uses 4 bytes to store a character.

Let’s look at this table which lists the 4 major encodings and some of their properties:

Encoding   Variable/Fixed   Min Bytes   Max Bytes
UTF-8 variable 1 4
UCS-2 fixed 2 2
UTF-16 variable 2 4
UTF-32 fixed 4 4

What this means, is that in order to represent a Unicode character (e.g. さ), a variable length encoding might require more than 1 byte, and in UTF-8′s case up to 4 bytes. UTF-8 needs potentially more bytes, since it maintains backward-compatibility with ASCII, and as such loses 7 bits.

Windows uses UTF-16 to store strings internally, as do most Unicode frameworks such as ICU and Qt‘s QString. Most Unixes on the other hand use UTF-8, and it’s also the most commonly found encoding on the web. Mac OSX is a bit of a different beast; due to it using a BSD kernel, all BSD system functions use UTF-8, whereas Apple’s Cocoa framework uses UTF-16.

UCS-2 or UTF-16
I had already mentioned that UTF-16 is an extension of UCS-2, so how does it extend it and why does it extend it?

You see, Unicode is so comprehensive now that it encompasses more than what you can store in 2 bytes. All characters (code points) from 0×0000 to 0xFFFF are in the “BMP“, the “Basic Multilingual Plane”. This is the plane that uses most of the character assignments, but additional planes exist, and here is a list of all planes:

•    The “BMP”, “Basic Multilingual Plane”, 0×0000 -> 0xFFFF
•    The “SMP”, “Supplementary Multilingual Plane”, 0×10000 -> 0x1FFFF
•    The “SIP”, “Supplementary Ideographic Plane”, 0×20000 -> 0x2FFFF
•    The “SSP”, “Supplementary Special-purpose Plane”, 0xE0000 -> 0xEFFFF

So technically, having 2 bytes available is not even enough anymore to cover all the available code points, you can only cover the BMP. And this is the main difference between UCS-2 and UTF-16, UCS-2 only supports code points in the BMP, whereas UTF-16 supports code points in the supplementary planes as well, through something called “surrogate pairs“.

Representation in Unicode
So let’s look at the above sample text in Unicode, shall we? Sayonara Shift_JIS & EUC-JP! The site has some great online tools for Unicode, one of which is called “Uniview“. It shows us the actual Unicode code points, the symbol itself and the official description:

eventlogblog_unicode_uniview.pngThe official Unicode notation (U+hex) for the above characters uses the U+ syntax, so for the above letters we would write:

U+3055 U+3088 U+3046 U+306A U+3089

With this information, we can now apply one of the UTF encodings to see the difference:


E3 81 95 E3 82 88 E3 81 86 E3 81 AA E3 82 89

30 55 30 88 30 46 30 6A 30 89

00 00 30 55 00 00 30 88 00 00 30 46 00 00 30 6A 00 00 30 89

So UTF-8 uses 5 more bytes than UCS-2/UTF-16 to represent the same exact characters. Remember that UCS-2 and UTF-16 would be identical for this text since all characters are in the BMP. UTF-32 uses yet 5 more bytes then UTF-8 and would be require the most storage space, as to be expected.

What you can also see here, is that UTF-16 essentially mirrors the U+ notation.

Fixed Length or Variable Length?
Both encoding types have their advantages and disadvantages, and I will be comparing the most popular UTF encodings, UTF-8 and UCS-2, here:

Variable Length UTF-8:
•    ASCII-compatible
•    Uses potentially less space, especially when storing ASCII
•    String analysis/manipulation (e.g. length calculation) is more CPU-intensive

Fixed Length UCS-2:
•    Potentially wastes space, since it always uses fixed amount of storage
•    String analysis/manipulation is usually less CPU intensive

Which encoding to use will depend on the application. If you are creating a web site, then you should probably choose UTF-8. If you are storing data in a database however, then it will depend on the type of strings that will be stored. For example, if you are only storing languages that cannot be represented through ASCII, then it is probably better to use UCS-2. If you are storing both ASCII and languages that require Unicode, then UTF-8 is probably a better choice. An extreme example would be storing English-Only text in a UCS-2 database – it would essentially use twice as much storage as an ASCII version, without any tangible benefits.

One of the strongest suits of UTF-8, at least in my opinion, is its backward compatibility with ASCII. UTF-8 doesn’t use any numbers below 127 (0x7F), which are – well – reserved for ASCII characters. This means that all ASCII text is automatically UTF-8 compatible, since any UTF-8 parser will automatically recognized those characters as being ASCII and render them appropriately.

And this brings us to the next topic – the BOM (header). BOM stands for “Byte Order Mark”, and is usually a 2-4 byte long header in the beginning of a Unicode text stream, e.g. a text file. If a text editor does not recognize a BOM header, then it will usually display the BOM header as either the þÿ or ÿþ characters.

The purpose of the BOM header is to describe the Unicode encoding, including the endianess, of the document. Note that a BOM is usually not used for UTF-8.

Let’s revisit the example from earlier, the UTF-16 encoding looked like this:

30 55 30 88 30 46 30 6A 30 89

If we wanted to store this text in a file, including a BOM header, then it could look also look like this:

FF FE 55 30 88 30 46 30 6A 30 89 30

“FF FE” is the BOM header, and in this case indicates that a UTF-16 Little Endian encoding is used. The same text in UTF-16 Big Endian would look like this:

FE FF 30 55 30 88 30 46 30 6A 30 89

The BOM header is generally only useful when Unicode encoded documents are being exchanged between systems that use different Unicode encodings, but given the extremely little overhead it certainly doesn’t hurt to add it to any UTF-16 encoded document. As such, Windows always adds a 2-byte BOM header to all Unicode text documents. It is the responsibility of the text reader (e.g. an editor) to interpret the BOM header correctly. Linux on the other hand, being a UTF-8 fan and all, does not need to (and does not) use a BOM header – at least not by default.

Tools & Resources
There are a variety of resources and tools available to help with Unicode authoring, conversions, and so forth.

I personally like Ultraedit, which lets me convert documents to and from UTF-8 and UTF-16, and also supports the BOM headers. GEdit on Linux is also very capable, and supports different code pages (if you ever need to use those) as well. Babelpad is an editor designed specifically for Unicode, and seems to support every possible encoding. I have not actually used this editor though.

A nifty online converter that I already mentioned earlier can be found at, and also check out UniView:

The official Unicode website is of course a great resource too, though potentially overwhelming to mere mortals that only have to deal with Unicode occasionally. The best place to start is probably their basic FAQ:

I hope this provides some clarification for those who know that Unicode exists, but are not entirely comfortable with the details.


Running Linux applications on Windows – over the network with Xming

I always find it interesting to
see clothes and accessories that were in fashion 30 years ago, make it back
into the mainstream. It seems like the computer industry also goes in cycles
every now and then.

Back in the early days of
computing – before the dawn of the glorious PC era – there were few powerful
servers that were accessed by dumb terminals. The emergence of the IBM PC
changed all that and eventually led to the rich clients that most of us have
under our desks today. The traditional PC desktop however causes quite a bit of
management overhead – especially in large organizations – which appears to be
leading to the re-emergence of “dumb” terminals that access a powerful – well –
terminal server. Only this time we have a fancy user interface.


If you have worked with Unix-like
operating systems before, then you’re probably familiar with the X windows
, though most people don’t know about the X Windows system’s (from now on referenced to as X11) network
. In essence, you can run
an application on host A, but
actually display and interact with the application on host B. Furthermore, you can actually utilize X11 to remotely log into a
host running X11 without the need to install additional software on that host –
provided that X11 is configured to support this. The screenshot below shows this a bit better.So what does this mean in
practice? You can install a resource-hungry application on a dedicated and
powerful Linux host, yet run and execute the application on a different, less
powerful Linux machine – even if that machine is not even running Linux. What’s
even better is that those remote applications appear just like any other
application on your desktop. Citrix calls this “application publishing”, and
Microsoft introduced “TS RemoteApp” with the Windows Server 2008 platform. Yet,
X Windows has offered this functionality for decades – from the very start.

But what makes this feature
really interesting for us windows admins (or Unix admins that, for whatever reason, have to use a Windows workstation), is the fact that you can install an X
server on your windows machine and run Linux applications “natively” on it
– thanks to the open-source project Xming).

Xming, according to the project
web site, is the “leading free unlimited
X Window Server for Microsoft Windows® (XP/2003/Vista)
”. There have been
security concerns in the past when using X11 remotely, but by tunneling X11
traffic through SSH, Xming is actually quite secure and doesn’t usually require
any configuration changes on the host running X11 (phew!).

When tasked with either cross-platform
system administration or development, the discovery of Xming opens up a door of
possibilities. For example, you can edit remote configuration files
conveniently by running your favorite Linux editor on your Windows desktop, or
run a terminal like gnome-terminal. Why run a terminal through X-Windows when
you can just use an SSH app like PuTTY? For one thing, you can launch GUI
applications directly from the terminal (e.g. ‘gedit &’) on your Windows
desktop. Of course, you can also play a Linux game on Windows that way.

If you’re a cross-platform
developer, then you can execute a Linux/Unix development studio (e.g. eclipse)
on your Windows box – and it appears just like any other Windows app. And since
it’s technically running on the Linux box, compiling on your Windows app really
compiles it on the remote platform (e.g. Linux). The responsiveness of applications is also quite good, at least over an Ethernet connection.

This technique also works for
multiple end users, so it’s also possible to connect to one Linux machine from
multiple Windows machines and run Linux apps. The Linux machine really acts
like a terminal server in this case.

Let’s look at how to run a Linux
app on a Windows desktop. I used Ubuntu 8.10 and installed Xming on a Vista laptop. So, download & install the following Xming
packages from

  • Xming
  • Xming-fonts

Then, start XLaunch from the
start menu and select the following options:

  1. Multiple Windows
  2. Start a program
  3. Start program: Enter the application you want to
    launch there. E.g. gnome-terminal,
    gedit, mahjongg
    or whichever remote application you want to run
  4. Run remote – using PuTTY: Select this option and
    specify the computer name, user name and password.
  5. On the next step, simply leave the default options in
    place, click “Next” and “Finish”.


You should now have a little X
icon on the tray, and the application you selected should be running on your
desktop. The screenshot below shows gnome-terminal and gnome-text-editor
running on my Vista machine.


Xming uses plink.exe (see also:
internally to execute apps, whose display is then redirected to our local Windows
client, on the remote host. You can also save these settings in a configuration file and create a shortcut on your desktop or start menu.

If the XDMCP protocol is enabled
on the Linux/Unix host (disabled by default on most distributions for security
reasons), then you can log into the remote host for a complete remote session
similar to VNC or other remote desktop applications. But again, keep in mind
that XDMCP transmits data in clear text over the wire (using both TCP and UDP),
and as such is an insecure protocol that should only be enabled in trusted
networks. To log in remotely with Xming, select the following options after
starting XLaunch:

  • One Window
  • Open session via XDMCP
  • Specify the remote host name

xming_xlaunch_xdmcp.pngOne last tip regarding Xming: If, at some point down the line, you are unable to launch remote apps on your desktop, even though the X tray icon from Xming is present, then try to reset the X server by right-clicking the tray icon and choosing “Exit”.

Well, I hope this gives you a
starting point and helps ease the pain when maintaining heterogeneous network

Until next time,