j teaches you: binary

(and a bunch of related concepts!)

this lesson is a work in progress, last updated 2015-02-23.

this lesson assumes you know nothing about computers.

the basics

decimal, the number system you’re most familiar with, is base 10: digits range from 0 to 9, which means each digit has ten possible states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. decimal digits are named from the Latin digitus, meaning “a finger or toe”; most humans have 10 of each.

binary, the number system computers are most familiar with, is base 2: digits range from 0 to 1, which means each digit has two possible states: 0 and 1, sometimes represented as off and on, or false and true. binary digits are called bits, short for binary digits. their unit symbol is bit; you may sometimes see b used, which is discouraged because it can be confused with B.

individually, bits cannot store very much information—only ⅕ of what decimal digits can—but in combination they can store much more. a group of bits large enough to store a single character, such as j or @, is a byte. their unit symbol is B; you may sometimes see b used, which is very discouraged because it will likely be confused for bit.

older computers did not talk to each other—network—very often, so how they handled data varied much more between models (called architectures). because of this, the number of bits needed to store—encode—a character varied by architecture. these days, a byte is nearly always eight bits.

(this history is why technical writing often calls bytes octets, from the Latin oct-, meaning eight. their unit symbol is o. while octet is more specific, in common usage byte and octet are synonymous.)

representing numbers

in decimal, 245 is 245:

hun-
dreds
tens ones
102 101 100
= = =
100 10 1
× × ×
2 4 5
= = =
200 40 5

200 + 40 + 5 = 245, right? now, in binary, 245 is 1111 0101:

27 26 25 24 23 22 21 20
= = = = = = = =
128 64 32 16 8 4 2 1
× × × × × × × ×
1 1 1 1 0 1 0 1
= = = = = = = =
128 64 32 16 0 4 0 1

128 + 64 + 32 + 16 + 0 + 4 + 0 + 1 = 245. exactly the same method as before, but with a base of two instead of ten. 1111 01012 = 24510!

(when a number is written smaller and below the baseline—subscript—it’s telling you the base being used. you could potentially see a genetic codon written UGC4, or an English word as zoology26.)

note how each additional bit doubles the amount of data stored: seven bits hold 0 to 127 for 128 possible values, while a byte holds 0 to 255 for a total of 256 possible values.

(representing negative numbers in binary is significantly more complicated, and is beyond the scope of this lesson.)

counting

a table’s worth a thousand words:

23 22 21 20 X10 X16
0 0 0 0 0 0
0 0 0 1 1 1
0 0 1 0 2 2
0 0 1 1 3 3
0 1 0 0 4 4
0 1 0 1 5 5
0 1 1 0 6 6
0 1 1 1 7 7
1 0 0 0 8 8
1 0 0 1 9 9
1 0 1 0 10 a
1 0 1 1 11 b
1 1 0 0 12 c
1 1 0 1 13 d
1 1 1 0 14 e
1 1 1 1 15 f

take some time to notice the patterns; it may help you internalise them by counting in binary using your literal digits as bits:

  1. hold your right hand as a fist, palm up, for 0000
  2. extend your thumb for 0001
  3. retract your thumb and extend your index finger for 0010
  4. extend your thumb again for 0011
  5. retract your thumb and index finger and extend your middle finger for 0100
  6. extend your thumb for 0101
  7. retract your thumb and extend your index finger for 0110
  8. extend your thumb for 0111
  9. retract all three digits and extend your ring finger for 1000
  10. extend your thumb for 1001
  11. retract your thumb and extend your index finger for 1010
  12. extend your thumb again for 1011
  13. retract your thumb and index finger and extend your middle finger for 1100
  14. extend your thumb for 1101
  15. retract your thumb and extend your index finger for 1110
  16. extend your thumb for 1111

notice how the frequency you flip each digit on and off matches that of the table: the “ones” bit, your thumb, toggles every time. the “twos” bit, your index finger, alternates between off for two and on for two. the “fours” bit, your middle finger, is off four then on four.

(you can count to 31 on a single hand by using your pinky as a “sixteens” bit; it’s quite handy useful! some people use both hands to count significantly higher.)

hexadecimal

the table shows all possible states for four bits—half of a byte—commonly called a nibble, sometimes spelt nybble. (yes, nerds are nerds.) in more formal contexts, nibbles are called quartets, but that’s decidedly less fun.

as seen in the last column of the table, all 16 possible nibbles can be represented by a single digit in base 16, called hexadecimal. hexadecimal digits range from 0 to f, which means each has sixteen possible states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f.

the digits a to f can be written in either uppercase or lowercase; which is used often falls to personal preference. note that uppercase A can resemble 4, B can resemble 8, and D can resemble 0, whilst lowercase b can resemble 6, especially in handwriting.

(similarly, if you have to transmit hexadecimal via speech you may want to use the NATO phonetic alphabet.)

because a hexadecimal digit is half a byte, pairs of them are frequently used as shorthand: it’s easier to read, write, and remember f5 than 1111 0101.

(a common programming notation is prefixing binary numbers with 0b, decimal numbers with nothing, and hexadecimal numbers with 0x; for example: 0b1111 0101, 245, and 0xf5. now that we’ve established the basics of numbers, this is the style the rest of this lesson will use to help acclimate you to what you’ll most commonly see ‘in the wild’.)

encoding text

now you know how to read binary numbers, so let’s move on to text. as said before, the way old computers encoded text varied much more between architectures. in the early 1960s the US-ASCII encoding was standardised, and it persists to this day.

to save space—both in storage and transmission, which were far more expensive then—ASCII is 7-bit, giving 128 code points:

hex 0 1 2 3 4 5 6 7 8 9 a b c d e f
bin 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
0 000 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 001 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 010 SP ! " # $ % & ' ( ) * + , - . /
3 011 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 100 @ A B C D E F G H I J K L M N O
5 101 P Q R S T U V W X Y Z [ \ ] ^ _
6 110 ` a b c d e f g h i j k l m n o
7 111 p q r s t u v w x y z { | } ~ DEL

the first two rows—and, for clever reasons, the very last code point—are control characters: they control the flow of data, and most are rarely used these days.

the first 10 characters of the 0x3 row are the Arabic numerals we know and love. their encoding is very straightforward: 2, for example, is simply 0x32, and 6 is 0x36.

(representing 26 as 0x32 0x36 instead of 0x1a is called binary-coded decimal.)

the 0x4 and 0x5 rows are mostly the uppercase letters of the Latin alphabet, and the 0x6 and 0x7 rows are mostly their lowercase equivalents: the only difference between J (0b100 1010) and j (0b110 1010) is the sixth bit; this is useful for case insensitive comparisons of text.

all other code points are symbols, the most notable being 0x20, the blank space.

(if you’ve ever noticed %20s in URLs, those are indeed spaces: characters that aren’t allowed in URLs are represented with the percent sign followed by their code point. for example, a comma is encoded as %2C, and a question mark as %3F. this is called percent-encoding or, colloquially, URL encoding.)

control characters and whitespace

the most commonly used control characters are, not coincidentally, the ones pertinent to text:

tab is the indentation character you’re likely already familiar with; it’s sometimes abbreviated \t. the two line breaks, however, are a little trickier. historically, it was chaotic, but thankfully these days newlines are generally represented one of two ways:

1. solely with 0x0a, the line feed, sometimes abbreviated \n or LF. you may hear this called Unix-style, as it’s used by all Unix operating systems and their descendants, including Linux and OS X.

2. a carriage return followed by a line feed, 0x0d 0x0a, sometimes abbreviated \r\n or CR+LF. you may hear this called Windows-style, as it’s used by Microsoft Windows, as well as most text-based Internet protocols.

(i remember them as \nice and \really \nice; if you’re feeling snarky, you could call Windows-style \redundant \newlines.)

if you’ve ever opened a .txt file in Notepad and found all the text run together on one line, differences in newlines was to blame. this lesson was written with Unix-style newlines, but uploaded to my server via FTP, the File Transfer Protocol, and served to you via HTTP, the HyperText Transfer Protocol, both of which use Windows-style newlines. interoperability!

the other two control characters that affect text are the rarely used:

these, along with the three mentioned above and the space (0x20) are called whitespace characters: not graphical symbols, but characters that affect the spacing of the text. whitespace is an important concept in computing.

control characters: miscellany

of note are the first and last characters of ASCII:

in the days of punched tape, null was a lack of holes where a byte could be written later; delete would punch out all holes, obliterating whatever byte had been there.

because the control characters aren’t graphical symbols—they’re non-printing characters—they can be written in caret notation: the caret (^) followed by the ASCII with the seventh bit flipped:

DEL might seem like a special case, but it’s also its ASCII with the seventh bit flipped: DEL (0b111 1111) = ^? (0b011 1111).

reading text

let’s get to it:

59 65 61 68 2c 20 69 74 20 72 65 61 6c 6c 79 20 77 61 73 2e

at a glance we see a lot of characters that start with 0x6 and 0x7, so this is mostly lowercase text. they’re separated by a few 0x20s—spaces—so there are discrete words. it begins with a capital letter (0x5), and ends with a period (0x2e). just by looking it over we can deduce it’s probably a sentence.

starting at ‘a’ you can count out each character on your hand; i find it helpful to remember some milestones:

(*many would easily answer that the ninth letter of the alphabet is ‘i’, but would pause if asked the ‘a-th’ letter.)

let’s break it down: the characters up to the first space are 59 65 61 68 2c 20.

1. 0x59 is our only capital; we can either count up nine from ‘P’ (0x50), or back once from ‘Z’ (0x5a), giving us ‘Y’.

2. 0x65 is the fifth letter of the alphabet, four up from ‘a’: ‘e’.

3. 0x61 literally is ‘a’, so that was a freebie.

4. 0x68 is the eighth letter of the alphabet, seven up from ‘a’: ‘h’.

5. you may recall that the 0x2 row are all symbols; 0x2c is the comma.

6. 0x20, a space.

so far we have “Yeah, ”; continue on with 69 74 20 72 65 61 6c 6c 79 20 77 61 73 2e.

the answer

did you get it? it says:

596561682c206974207265616c6c79207761732e
Yeah, it really was.

to do: