0

Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).

I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.

The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?

So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?

2 Answers 2

0

Too vague in terms of whats really needed.

If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.

Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.

If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.

-1

The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).

Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:

byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");

The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).

Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes). My recommendation is that you use the text version, as it will probably be easier to parse and more readable.

About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).

4
  • How do you calculate the numbers? I was thinking that in a string representation each number is one char = one byte? Whereas if I store it as bits then i.e. 1024 = 10000000000 = 11 bits? While in text 1024 = 4*8 = 32?
    – MrJalapeno
    Commented Sep 10, 2015 at 13:06
  • 11 bits, but as I've said you need of a way to know where the number starts and where it ends. That is done giving it a fixed size, usually 2 or 4 bytes. The string representation is one byte per char, yes.
    – eric.m
    Commented Sep 10, 2015 at 13:10
  • Ignoring the return value of InputStream.read is asking for trouble.
    – VGR
    Commented Sep 10, 2015 at 14:31
  • Sorry, I didn't see that. I've edited my answer now.
    – eric.m
    Commented Sep 10, 2015 at 14:38

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.