How can you get more characters than there are Bytes?

lostsoul62

Member
I have an excel 2010 file with 6 tabs. One tab has 195,000 words (It has more because I didn't count the 6,000+ dates) and the other 5 tabs have about 50,000 words which is a total of "Let say 250,000 words" which comes out to about 1,250,000 characters or Bytes (5 Characters per word is the norm) and it only takes up 822K of space. Would someone explain how that is possible?
 

Okedokey

Well-Known Member
One of the first encoding schemes to be developed to use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960's in the United States.

The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not considering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.

The 1960's was also a time where computers didn't have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character to be 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article mentions the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be transmitted using 7 bits, the eight could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical and terminal characters.

With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.

Unicode came as a solution to the existence of different terminals, by merging all possible meaningful characters into a single abstract character set.

UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (e.g. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain to be one byte big whilst any other characters are two or more bytes big. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.

The 'a' character occupies a single byte while 'ա' occupies two bytes, denoting a UTF-8 encoding. The extra byte in your question was due to the existence of a newline character at the end.
 

lostsoul62

Member
Thank you for the history lesson which learn the first year of college. However you haven't answered my question. Your last sentence doesn't explain anything. Would you explain how "w" occupies two bytes assuming your using the ASCII code?
 

mihir

VIP Member
Thank you for the history lesson which learn the first year of college. However you haven't answered my question. Your last sentence doesn't explain anything. Would you explain how "w" occupies two bytes assuming your using the ASCII code?

He/She has more than answered your question. I would suggest reading up on character encoding more. Its a very interesting topic(I ask it a lot during interviews), even databases use it for some minor compression.

I am not too sure about Excel but Libre Office offers you to choose encoding while saving a spreadsheet. So wi would suggest checking out the file encoding.

And the wiki page on UTF-8 is pretty good, if you can don't want to take a generic approach towards understanding variable width encoding - http://en.wikipedia.org/wiki/UTF-8

@Okedokey
:good: for the reply :)
 
Last edited:

spirit

Moderator
Staff member
One of the first encoding schemes to be developed to use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960's in the United States.

<etc>

Thanks for this post. I so happen to have a test on some of this stuff after the Christmas holiday so this is good revision material for me too! ;)
 
Top