j teaches you: binary

(and a bunch of related concepts!)

this lesson is a work in progress, last updated 2015-02-23.

this lesson assumes you know nothing about computers.

the basics

decimal, the number system you’re most familiar with, is base 10: digits range from 0 to 9, which means each digit has ten possible states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. decimal digits are named from the Latin digitus, meaning “a finger or toe”; most humans have 10 of each.

binary, the number system computers are most familiar with, is base 2: digits range from 0 to 1, which means each digit has two possible states: 0 and 1, sometimes represented as off and on, or false and true. binary digits are called bits, short for binary digits. their unit symbol is bit; you may sometimes see b used, which is discouraged because it can be confused with B.

individually, bits cannot store very much information—only ⅕ of what decimal digits can—but in combination they can store much more. a group of bits large enough to store a single character, such as j or @, is a byte. their unit symbol is B; you may sometimes see b used, which is very discouraged because it will likely be confused for bit.

older computers did not talk to each other—network—very often, so how they handled data varied much more between models (called architectures). because of this, the number of bits needed to store—encode—a character varied by architecture. these days, a byte is nearly always eight bits.

(this history is why technical writing often calls bytes octets, from the Latin oct-, meaning eight. their unit symbol is o. while octet is more specific, in common usage byte and octet are synonymous.)

representing numbers

in decimal, 245 is 245:

10²	10¹	10⁰
hun- dreds	tens	ones
=	=	=
100	10	1
×	×	×
2	4	5
=	=	=
200	40	5

200 + 40 + 5 = 245, right? now, in binary, 245 is 1111 0101:

2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
=	=	=	=	=	=	=	=
128	64	32	16	8	4	2	1
×	×	×	×	×	×	×	×
1	1	1	1	0	1	0	1
=	=	=	=	=	=	=	=
128	64	32	16	0	4	0	1

128 + 64 + 32 + 16 + 0 + 4 + 0 + 1 = 245. exactly the same method as before, but with a base of two instead of ten. 1111 0101₂ = 245₁₀!

(when a number is written smaller and below the baseline—subscript—it’s telling you the base being used. you could potentially see a genetic codon written UGC₄, or an English word as zoology₂₆.)

note how each additional bit doubles the amount of data stored: seven bits hold 0 to 127 for 128 possible values, while a byte holds 0 to 255 for a total of 256 possible values.

(representing negative numbers in binary is significantly more complicated, and is beyond the scope of this lesson.)

counting

a table’s worth a thousand words:

2³	2²	2¹	2⁰	X₁₀	X₁₆
0	0	0	0	0	0
0	0	0	1	1	1
0	0	1	0	2	2
0	0	1	1	3	3
0	1	0	0	4	4
0	1	0	1	5	5
0	1	1	0	6	6
0	1	1	1	7	7
1	0	0	0	8	8
1	0	0	1	9	9
1	0	1	0	10	a
1	0	1	1	11	b
1	1	0	0	12	c
1	1	0	1	13	d
1	1	1	0	14	e
1	1	1	1	15	f

take some time to notice the patterns; it may help you internalise them by counting in binary using your literal digits as bits:

hold your right hand as a fist, palm up, for 0000
extend your thumb for 0001
retract your thumb and extend your index finger for 0010
extend your thumb again for 0011
retract your thumb and index finger and extend your middle finger for 0100
extend your thumb for 0101
retract your thumb and extend your index finger for 0110
extend your thumb for 0111
retract all three digits and extend your ring finger for 1000
extend your thumb for 1001
retract your thumb and extend your index finger for 1010
extend your thumb again for 1011
retract your thumb and index finger and extend your middle finger for 1100
extend your thumb for 1101
retract your thumb and extend your index finger for 1110
extend your thumb for 1111

notice how the frequency you flip each digit on and off matches that of the table: the “ones” bit, your thumb, toggles every time. the “twos” bit, your index finger, alternates between off for two and on for two. the “fours” bit, your middle finger, is off four then on four.

(you can count to 31 on a single hand by using your pinky as a “sixteens” bit; it’s quite ~~handy~~ useful! some people use both hands to count significantly higher.)

hexadecimal

the table shows all possible states for four bits—half of a byte—commonly called a nibble, sometimes spelt nybble. (yes, nerds are nerds.) in more formal contexts, nibbles are called quartets, but that’s decidedly less fun.

as seen in the last column of the table, all 16 possible nibbles can be represented by a single digit in base 16, called hexadecimal. hexadecimal digits range from 0 to f, which means each has sixteen possible states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f.

the digits a to f can be written in either uppercase or lowercase; which is used often falls to personal preference. note that uppercase A can resemble 4, B can resemble 8, and D can resemble 0, whilst lowercase b can resemble 6, especially in handwriting.

(similarly, if you have to transmit hexadecimal via speech you may want to use the NATO phonetic alphabet.)

because a hexadecimal digit is half a byte, pairs of them are frequently used as shorthand: it’s easier to read, write, and remember f5 than 1111 0101.

(a common programming notation is prefixing binary numbers with 0b, decimal numbers with nothing, and hexadecimal numbers with 0x; for example: 0b1111 0101, 245, and 0xf5. now that we’ve established the basics of numbers, this is the style the rest of this lesson will use to help acclimate you to what you’ll most commonly see ‘in the wild’.)

encoding text

now you know how to read binary numbers, so let’s move on to text. as said before, the way old computers encoded text varied much more between architectures. in the early 1960s the US-ASCII encoding was standardised, and it persists to this day.

to save space—both in storage and transmission, which were far more expensive then—ASCII is 7-bit, giving 128 code points:

hex		0	1	2	3	4	5	6	7	8	9	a	b	c	d	e	f
	bin	0000	0001	0010	0011	0100	0101	0110	0111	1000	1001	1010	1011	1100	1101	1110	1111
0	000	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
1	001	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US

2	010	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3	011	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?

4	100	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5	101	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_

6	110	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7	111	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	DEL

the first two rows—and, for clever reasons, the very last code point—are control characters: they control the flow of data, and most are rarely used these days.

the first 10 characters of the 0x3 row are the Arabic numerals we know and love. their encoding is very straightforward: 2, for example, is simply 0x32, and 6 is 0x36.

(representing 26 as 0x32 0x36 instead of 0x1a is called binary-coded decimal.)

the 0x4 and 0x5 rows are mostly the uppercase letters of the Latin alphabet, and the 0x6 and 0x7 rows are mostly their lowercase equivalents: the only difference between J (0b100 1010) and j (0b110 1010) is the sixth bit; this is useful for case insensitive comparisons of text.

all other code points are symbols, the most notable being 0x20, the blank space.

(if you’ve ever noticed %20s in URLs, those are indeed spaces: characters that aren’t allowed in URLs are represented with the percent sign followed by their code point. for example, a comma is encoded as %2C, and a question mark as %3F. this is called percent-encoding or, colloquially, URL encoding.)

control characters and whitespace

the most commonly used control characters are, not coincidentally, the ones pertinent to text:

HT, horizontal tab, 0x09
LF, line feed, 0x0a
CR, carriage return, 0x0d

tab is the indentation character you’re likely already familiar with; it’s sometimes abbreviated \t. the two line breaks, however, are a little trickier. historically, it was chaotic, but thankfully these days newlines are generally represented one of two ways:

1. solely with 0x0a, the line feed, sometimes abbreviated \n or LF. you may hear this called Unix-style, as it’s used by all Unix operating systems and their descendants, including Linux and OS X.

2. a carriage return followed by a line feed, 0x0d 0x0a, sometimes abbreviated \r\n or CR+LF. you may hear this called Windows-style, as it’s used by Microsoft Windows, as well as most text-based Internet protocols.

(i remember them as \nice and \really \nice; if you’re feeling snarky, you could call Windows-style \redundant \newlines.)

if you’ve ever opened a .txt file in Notepad and found all the text run together on one line, differences in newlines was to blame. this lesson was written with Unix-style newlines, but uploaded to my server via FTP, the File Transfer Protocol, and served to you via HTTP, the HyperText Transfer Protocol, both of which use Windows-style newlines. interoperability!

the other two control characters that affect text are the rarely used:

VT, vertical tab, 0x0b
FF, form feed, 0x0c

these, along with the three mentioned above and the space (0x20) are called whitespace characters: not graphical symbols, but characters that affect the spacing of the text. whitespace is an important concept in computing.

control characters: miscellany

of note are the first and last characters of ASCII:

NUL, null, 0x00 (0b000 0000)
DEL, delete, 0x7f (0b111 1111)

in the days of punched tape, null was a lack of holes where a byte could be written later; delete would punch out all holes, obliterating whatever byte had been there.

because the control characters aren’t graphical symbols—they’re non-printing characters—they can be written in caret notation: the caret (^) followed by the ASCII with the seventh bit flipped:

NUL (0b000 0000) = ^@ (0b100 0000)
BS (0b000 1000) = ^H (0b100 1000)
LF (0b000 1010) = ^J (0b100 1010)
CR (0b000 1101) = ^M (0b100 1101)
ESC (0b001 1011) = ^[ (0b101 1011)

DEL might seem like a special case, but it’s also its ASCII with the seventh bit flipped: DEL (0b111 1111) = ^? (0b011 1111).

reading text

let’s get to it:

59 65 61 68 2c 20 69 74 20 72 65 61 6c 6c 79 20 77 61 73 2e

at a glance we see a lot of characters that start with 0x6 and 0x7, so this is mostly lowercase text. they’re separated by a few 0x20s—spaces—so there are discrete words. it begins with a capital letter (0x5), and ends with a period (0x2e). just by looking it over we can deduce it’s probably a sentence.

starting at ‘a’ you can count out each character on your hand; i find it helpful to remember some milestones:

0x61 = ‘a’, the beginning
0x6a = ‘j’, a corollary to ‘z’, and the first letter on its row beyond the Arabic numerals to which you’re accustomed*
0x6f = ‘o’, the last of the 0x6 row
0x70 = ‘p’, the first of the 0x7 row
0x7a = ‘z’, the end, and the only letter on its row beyond the Arabic numerals

(*many would easily answer that the ninth letter of the alphabet is ‘i’, but would pause if asked the ‘a-th’ letter.)

let’s break it down: the characters up to the first space are 59 65 61 68 2c 20.

1. 0x59 is our only capital; we can either count up nine from ‘P’ (0x50), or back once from ‘Z’ (0x5a), giving us ‘Y’.

2. 0x65 is the fifth letter of the alphabet, four up from ‘a’: ‘e’.

3. 0x61 literally is ‘a’, so that was a freebie.

4. 0x68 is the eighth letter of the alphabet, seven up from ‘a’: ‘h’.

5. you may recall that the 0x2 row are all symbols; 0x2c is the comma.

6. 0x20, a space.

so far we have “Yeah, ”; continue on with 69 74 20 72 65 61 6c 6c 79 20 77 61 73 2e.

the answer

did you get it? it says:

59	65	61	68	2c	20	69	74	20	72	65	61	6c	6c	79	20	77	61	73	2e
Y	e	a	h	,		i	t		r	e	a	l	l	y		w	a	s	.

to do:

general cleanup
explain bitmasks
prob don't bother with LSB / endianness; give brief mention perhaps
explain KB vs KiB
have cutoff after ASCII where other encodings are explained for the curious (ISO 8859-1, Windows-1252, UTF-8)
teach how to read the binary structure of UTF-8

2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
=	=	=	=	=	=	=	=
128	64	32	16	8	4	2	1
×	×	×	×	×	×	×	×
1	1	1	1	0	1	0	1
=	=	=	=	=	=	=	=
128	64	32	16	0	4	0	1

2³	2²	2¹	2⁰	X₁₀	X₁₆
0	0	0	0	0	0
0	0	0	1	1	1
0	0	1	0	2	2
0	0	1	1	3	3
0	1	0	0	4	4
0	1	0	1	5	5
0	1	1	0	6	6
0	1	1	1	7	7
1	0	0	0	8	8
1	0	0	1	9	9
1	0	1	0	10	a
1	0	1	1	11	b
1	1	0	0	12	c
1	1	0	1	13	d
1	1	1	0	14	e
1	1	1	1	15	f

2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
=	=	=	=	=	=	=	=
128	64	32	16	8	4	2	1
×	×	×	×	×	×	×	×
1	1	1	1	0	1	0	1
=	=	=	=	=	=	=	=
128	64	32	16	0	4	0	1

2³	2²	2¹	2⁰	X₁₀	X₁₆
0	0	0	0	0	0
0	0	0	1	1	1
0	0	1	0	2	2
0	0	1	1	3	3
0	1	0	0	4	4
0	1	0	1	5	5
0	1	1	0	6	6
0	1	1	1	7	7
1	0	0	0	8	8
1	0	0	1	9	9
1	0	1	0	10	a
1	0	1	1	11	b
1	1	0	0	12	c
1	1	0	1	13	d
1	1	1	0	14	e
1	1	1	1	15	f

2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
=	=	=	=	=	=	=	=
128	64	32	16	8	4	2	1
×	×	×	×	×	×	×	×
1	1	1	1	0	1	0	1
=	=	=	=	=	=	=	=
128	64	32	16	0	4	0	1

2³	2²	2¹	2⁰	X₁₀	X₁₆
0	0	0	0	0	0
0	0	0	1	1	1
0	0	1	0	2	2
0	0	1	1	3	3
0	1	0	0	4	4
0	1	0	1	5	5
0	1	1	0	6	6
0	1	1	1	7	7
1	0	0	0	8	8
1	0	0	1	9	9
1	0	1	0	10	a
1	0	1	1	11	b
1	1	0	0	12	c
1	1	0	1	13	d
1	1	1	0	14	e
1	1	1	1	15	f