Bayes' Theorem

May 12, 2012 3:38 pm

By request, here's my attempt to explain Bayes' theorem (cribbing heavily from Wikipedia).

Deriving Bayes' Theorem

We start with the standard definition of conditional probability (for events A and B):

3c5bf4874edc582346cb78a5eb66aaeb

Which reads: The probability of event A given the known occurrence of event B is equal to the joint probability of events A and B (e.g., the probability of both events occurring) divided by the probability of event B (assuming the probability of B is not zero).  I'm not going to show the derivation of conditional probability.

The summation axiom helps us understand the joint probability of A and B:

9e168c9112e06ac47b11fef388957692

It tells us that the joint probability of A and B is equal to the probability of A given B multiplied by the probability of B.  All we're talking about is the likelihood of events A and B both occurring.

Keep the following equivalence in mind, we'll need it in a minute.  It simply says that the order of variables when writing the joint probability is irrelevant.  It should be fairly straightforward that the likelihood of events A and B both occurring is the same as the likelihood of events B and A both occurring.

joint_equivalance

Filling back in our definition of conditional probability, we have (with the understanding that P(B) is not 0):

bayes_derivation

The third equation is the simplest form of Bayes' theorem.  It wasn't very hard to get to and the math, relatively speaking, is quite simple (we're not talking about deriving the Schrödinger equation or anything absurd).  But its application, and understanding what it means, can be tricky.

P(A) is called the "prior," representing our prior belief in the occurrence of event A.

P(A|B) is called the "posterior," representing our belief in the occurrence of event A after (post) accounting for event B.

The remaining pieces, P(B|A) / P(B), are called the "support," representing the support event B provides for the likelihood of event A occurring.

Applying Bayes' Theorem

I'll re-use the medical testing scenario from the previous post.

Substituting T to represent a positive test result and D to indicate the presence of the disease our equation becomes:

test_disease

We expand P(T) into P(T|D)P(D) + P(T|¬D)P(¬D) which is to say: The probability of getting a positive test, P(T), is equal to [the probability of getting a positive test when the disease is present, P(T|D), multiplied by the probability that the disease is present, P(D)] plus [the probability of getting a positive test when the disease is not present, P(T|¬D), multiplied by the probability that the disease is not present, P(¬D)].

So now we just need to map the numbers we have about the disease and the test to the variables in our equation:

P(D), our prior, is 1 in 200 million--the disease occurrence rate in the general population.

P(T|D), the probability of getting a positive test when the disease is present, is derived from the false negative rate.  When the disease is present, our test will only incorrectly say it is not 1 out of 200,000 times; so P(T|D) is 199,999 out of 200,000.

P(T|¬D), the probability of getting a positive test when the disease is not present is the false-positive rate: 1 in 100,000.

P(¬D), the probability of the disease not being present is simply the other 199,999,999 out of 200 million.

So let's plug these numbers in step by step:

test_disease_computation

And we see that P(D|T), the probability that the disease is present given a positive test, is only 4.76%.  (My previous post incorrectly reported this as 0.05%, I've corrected the error.)

The car example I presented didn't use hard numbers, but I'll frame the concept into Bayes' Theorem.  S will represent a car being stolen and H will represent a car being a Honda Civic:

stolen_honda

Which says, the probability that a car is stolen given that it's a Honda Civic is equal to [the probability that a car is a Honda Civic given that it's stolen] multiplied by [the probability of a car being stolen] divided by [the probability of a car being a Honda Civic].

The dealership was trying to push an insurance policy on me by only reporting P(H|S), but unless we account for the number of Honda Civics on the road in the first place, P(H), we aren't getting the full story.

(Notice that we don't have to actually care about P(S) if all we're doing is comparing car make/models.  P(S) doesn't change for the different make/models so it doesn't affect the relative rankings.)

PaperTrust

October 5, 2011 7:10 pm

So, a while back I blogged an idea I had about cryptographically signing various documents.  I specifically talked about checks, but you can apply the principle anytime you have a fairly small amount of data which is supposed to be issued from a trusted source: cashier's checks, money orders, driver's licenses, event tickets, passports, boarding passes, etc.

Well, I spent some time playing around and put together a working example.  It's not fancy, but it does the job.  It's been a few months, but it really didn't take that long.  Especially since I had to do some reading about QR codes and using them, along with public-key cryptography, from Python.  So I had a basic prototype done in about a week.  Then back in August I decided to flesh things out a bit more and produce a nice demo application.  I'm calling the system "PaperTrust" as it allows you to embed the trust element onto the paper item.

Here's a video demonstration:

Text description of the demo:
So, in my demo, we generate data for a cashier's check and then sign it using the demo private key.  We stick the signed data (which includes a signing-organization ID) and the signature into a QR code and stick that onto the check and print it.  Now the check is physical and can be carried around as usual.

Now say you're going to use this check to pay for something from a stranger.  This stranger needs to know they can trust the check.  So they use their verifier application to scan the QR code from your check.  It reads the organization ID, looks up the correct public key for that organization, and verifies that the signature is valid.  It also displays the signed data so the person can compare it to what's physically printed on the check.  This is a cryptographically secure guarantee that the check is valid (or at worst an exact copy of a real check, which should make tracking down counterfeiters a lot easier).  So you would use this in tandem with traditional anti-forgery measures like watermarks, micro-print, thermal ink, etc.

I've put the code up on GitHub: PaperTrust on GitHub.

A Rumination on Science and Education

September 7, 2011 8:21 pm

I'm currently reading a biography of the physicist Richard Feynman (by James Gleick).  So far it's excellent.  What I'm really fascinated with right now is (at least how Gleick portrays) the progression of science during Feynman's schooling years (the mid to late 1930s).  The number of high caliber physicists at the time (and the time just leading up to it) is astounding: Einstein, Bohr, Rutherford, Heisenberg, Dirac, Lorentz, Schrödinger, De Broglie, Fermi, Oppenheimer, and I'm probably missing some still.  Those guys are each incredible scientists in their own right and it's no wonder the understanding of physics changed so dramatically during the 1930s.  The only comparison I can think of is the progression of art during the European Renaissance.

As I'm reading, I can't help but wonder about what set apart that time period in history from anything since in terms of scientific progression.  Computer Science has a similar vein of tumultuous rapid progression during the era of Turing, von Neumann, Dijkstra, Gödel, Church, Cook, Levin, Kleene, Shannon...But as I'm looking at it, most of these pioneers (in fact, all but Dijsktra) were essentially contemporaries of the physics revolution being discussed.  They all would have been products of the same time period of schooling (whether in the U.S. or Europe).  Which further raises the question of what was so different about the education systems through which these incredible people went?

Sadly, I don't really have an answer.  But if we're looking to reform our education system for better results, what better goal than to figure out what was happening in education from about 1910-1935?

But then, maybe it wasn't the education system at all.  Maybe it was the societal mindset about learning and discovery.  Maybe it was simply that the education system and society didn't inhibit the intense drive for understanding and innovation that these people felt.  Quoting from page 63 of the book (Genius: The Life and Science of Richard Feynman):

At MIT in the thirties the nerd did not exist; a penholder worn in the shirt pocket represented no particular gaucherie; a boy could not become a figure of fun merely by studying....America's future scientists and engineers, many of them rising from the working class, valued studiousness without question.

If this is an accurate portrayal of the time period, it certainly helps explain to me why so many incredible scientists were produced during that era.  Gleick describes in one passage of how Feynman and many of his contemporaries grew up reading the Encyclopedia Britannica eager to learn more about the world around them.

They tinkered with, broke, and repaired things--something I think is rarely encouraged these days.  I know this is one of the ways I developed my own interests in science and computers.  I wanted to learn how things worked, so I played with them, changed them, broke them, and attempted to repair them (sometimes successfully).

People are inquisitive by nature.  I think we, as a society, are getting far too good at crushing that inquisitiveness with standardized lesson plans which allow no room for deviation to follow student interests, standardized pedagogy which insists all students learn in the same way, and standardized tests which demand that all students regurgitate their "knowledge" in one, simplified fashion.

If there's one thing I learned in the years I worked as a T.A. it's that students assimilate information in incredibly varied ways.  Its hard to come up with new approaches to the material on-the-fly in order to try to help the student make the connection.  But if you don't, and instead insist on "the one true approach" to the material, the student will fall behind, become discouraged, and lose interest in the subject matter.

We need to encourage the asking of questions and the seeking out of answers by research, experimentation, or otherwise.  We need to foster the innate curiosity, creativity, and inquisitiveness that children have.

I'm not so concerned with the mindless consumption of media or playing of games because our minds need downtime to process and assimilate the world around us.  However, I think the hours spent watching TV and browsing the Internet are more of a symptom than a cause; in that we still seek out "new" things, just in a manner that parents aren't worried about anyone getting hurt or anything getting broken.  But situations where one might get hurt or something might get broken are, by far, the most likely situations where we might actually learn and remember a lesson.

Check forgery protection using public-key cryptography

April 11, 2011 8:31 pm

Mom forwarded an email that was attempting to scam her in response to a Craigslist ad she placed for some furniture.

While I was thinking about this I realized we have the ability to essentially stop check forgery, specifically cashier's checks and money orders, but the principle would also apply to personal checks if we could develop a trusted lookup source for public keys.

Public-key cryptography allows you to publish a public key that can be used to either verify that you digitally signed something with your private key or to encrypt something which can only be decrypted with your private key.

The application would be as follows.

First, the banks put together a trusted database of public keys. This part is essential, as it must be possible to lookup a public key for any bank and you need to have a trusted source where at you do the lookup. A central database is mainly a convenience factor, you could simply have each bank publish their public key on their own site, but a more integrated solution is more likely to be used. This is not an insurmountable hurdle.

Second, when a bank creates a cashier's check it uses the data on the check (name, amount, date, etc.) and their private key to produce a digitally signed digital copy (or digitally signed hash) of the data which could be printed directly on the check as a QR code (or set of QR codes depending on size) [QR codes are those square barcodes].

Third, when someone attempts to cash the check the cashing bank scans the QR code(s) and verifies that the data matches what's printed on the check and also looks up the public key of the issuing bank and verifies that the signature is legitimate. In fact the actual printed data would be unnecessary at this point if it was encoded in the QR code, but I imagine we'd want to leave it on for the sake of the humans handling the check.

That's it. If implemented correctly and securely it would guarantee the authenticity of cashier's checks. The same system could be used for money orders as well. The other great thing about it is that individuals could verify a check the same way. They could scan the QR codes themselves with their fancy phones and then lookup the bank's public key (either from a trusted central repository or from the individual bank) and verify the authenticity of the check without any risk.

The biggest hurdles would really be getting a trusted repository set up and having banks securely store their private keys. There are easy extensions making this process even more feasible. You can use a master key to create sub-keys which could be used by individual branches. That would limit the risk if any individual branch's private key were compromised. With a central repository a compromised bank would revoke the published public key and flag it as compromised. Any outstanding checks would need to be brought back to the issuing bank to be reissued using a new key. A hassle, but it should be a world-shattering occurrence for a private key to be compromised.

This system is totally possible with today's technology. It would just be a matter of setting it up and getting banks to participate. Maybe I should go talk to some venture capitalists...

Google's Instant Search - Now Active (for some)

September 8, 2010 9:40 am

I did a search a moment ago and was surprised to discover that real-time searching is enabled for my account. This is apparently Google's big announcement today. It's kind of neat; no more hitting "Enter" or clicking "Search".

It doesn't seem to be active for everyone yet. I pulled up a different browser without logging in to my Google account and there was no real-time searching there.

The name they're using is "Instant Search" there's an option next to the search box to turn off instant search:

instant_search