Thursday, July 7, 2011

Strings Can Suck It slight return.

I'm phoning it in for this one. I wrote this a while ago at work for the youngins. It is still relevant months later!!!


I hate strings. I hate few things, hate is a strong hurtful word. I hate strings. –Sounds like a song lyric.

I digress, strings are horrible horrible things. They are human readable, but we don’t write code for humans to read, we write it for computers. They can convey a great deal of information, they can also confuse the hell out of things in the many contexts afforded in the English language (26 different meanings for the word “set” really!). Strings in Games is an oxymoron. The two do not belong together. Ever. Even if it’s a text based adventure game, which if it is you have a time machine and live in 1983 when text games were cool.

Why the hate Playa? What’s so bad about them sweet strings? I’ll tell you. As I said I hate them:

  1. They take up a lot of run time memory. 1 Byte per character doesn’t seem like a lot but when you get in to the thousands of sentences it eats up the memories.
  2. They fragment and pollute memory. They are variable length. This makes the cache gods weep.
  3. Programmers abuse them. The number of times I have seen strings passed by value makes my eyes bleed.
  4. You have to write all sorts of stupid conversion tricks. Remember your old job interview programming tests? AtoI or change case? Blah blah blah
  5. They are like needles poking holes in your memory balloon. Pointer arithmetic, failed bounds checks, stomps! Corruption! Almost always because of a string.
  6. They are slow. We get clever and add a bunch of “stuff” into them i.e. “give_finger_hand_right” this needs to be split up into its parts when a simple event id would do.
  7. Everyone has their “own” awesome string class. It is not awesome. It is a string class. Ever see that Myth Busters episode where they polished animal poops into shiny spheres? String classes are like that but without the moustache dude.
  8. We abuse them for weird type conversion. String streams are not a good solution for the absence of RTTI. That’s sooo 2002. If you really need reflection (you don’t) templates will work fine provided you don’t go the crazy with them (This will be my next week rant).
  9. I didn’t even talk about Unicode. Unicode is like Satan’s string steroids. Unicode can internationally fuck off.

What’s it all mean Basil?

I hate them. Make them go away.

How?

Hash man, it’s all about hash. Now hashing isn’t a magic wand, it has not ended revolutionary wars to my knowledge but hash values are like beautiful flowers growing in the dirt that is strings.

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum. (from the Wiki land).

Here’s the basic idea:

“There once was a man from Nantucket who kept all of his strings in a bucket” -> MagicHashUnicornFunction() -> 4

All of that limerick converted to the number 4! Beautiful! How does that work? I don’t care, there are a million different hashing algorithms. You could use 64 bit hashes or 32 bit hashes but the generally idea is all of your huge polluting, variable sized, cache smashing, memory stomping, not cool, weird converting is condensed into a fixed length nice number that will fit nicely in a cache line and win you a frosty malt beverage from me at the pub.

We still need to have the string somewhere if it is required to display it on the screen. That unfortunately is unavoidable but it can sit on disc where our nice hash ultimately can index it and send it off to mister GPU and not hang about in our heaps or stack making the engine groan. I’m going to write a small test app that will use hashes as an index to an array of “proc-snippets “. My new made-up word. A proc snippet will be like a teletype ribbon (you’re old like me if you know what those are, I use to send love notes to one of the receptionists at Texas instruments when I was 5 over a teletype). We will use the proc-snippet to procedurally generate a sentence. Just out of curiosity as to how fast/slow it would be vs. disc access of said sentence.

Should you disagree with my hatred of the string please enlighten me with a book reference of your choice. I will hash the entire book and reply with the number 4.

2e5905a6

1 comment:

  1. In a commercial project, one to be supposed a performance critical application for ISP's, I passed a profile (callgrid) and the winner for calls/performance was .... strcmp(), and the silver plate gone for .... strcpy().

    Yes, string sucks. I avoid them at all cost.

    ReplyDelete