Observing Thinking

Observing Thinking
Observing Thinking

Tuesday, August 27, 2013

Metadata = Data? You bet. (for Sept 8 PR column)

As I mentioned in last month’s column, the story of NSA’s domestic surveillance made public by Edward snowden has legs. In fact, if the story were an insect, it would be a millipede. Now before you send me a nasty correction, let the record show that I know that, by definition, an insect is limited to six legs but “millipede” sounds so much better than “arachnid. It would seem that in this brave new digital age there should be not only millipedes but mega, giga, tera and even petapedes. No matter. Suffice to say that this story shows no signs of ending well or soon.

As of Aug 21, the latest twist to this thriller revealed that two years ago the FISA court strongly admonished the NSA for sweeping up domestic along with foreign intelligence gathering. The crux of the issue was that, without a warrant, the NSA had no authority to spy on US citizens and in fact, were violating the fourth amendment protecting citizens from unreasonable search.

I have spent some weeks researching the method that NSA must have used to intercept US citizen’s phone calls, emails and other Internet transactions and could only find the political and economic aspects --- how they pressured Internet providers like Verizon and AT&T to “share” their data unbeknownst to US users. There was very little information about the actual techniques applied to the data once the NSA had it in their hands.. So I decided to abandon the experiential approach and apply deduction instead. After all, I had taught the Database Management course in my career as Computer Science professor so why not put to use what I had learned? Here’s the way I think it went.:

Once the NSA had all of this data safely stored on their collection of disks they could make the first pass over the data to create their database. The three main functions of a database system are: Create, Update, and Interrogate. In the Create phase the raw data is usually indexed for rapid retrieval during the Update and Interrogate phases. Indexing is a fairly straightforward operation; if you’re of a certain age, you remember thumb-indexed dictionaries to faciitlate the Interrogate function. For example if you needed the definition of “mendacious” you could start your search immediately in the “M” section of the dictionary thanks to the handy thumb indentations rather than begin on page 1 and search sequentially from there. Techniques similar to this are embodied in computer programs whose job it is to update and interrogate large databases --- similar in kind but not in degree. These programs not only allow for multiple indexes as links to the database but are degrees of magnitude faster than manual methods.

For example, if I am the program looking at one of your emails, I can record the time and date it was sent, your and the recipient’s email addresses as well as any keywords that have been deemed important like: “bomb”, “Egypt”, “Syria”, “China”....you get the idea. Next, I determine the location in disk memory where this email will be stored but before I store it I make a note, in the form of a list which associates each of the keywords with that disk location. This process is repeated for all of the emails in the database and when it’s finished we have created a table of keywords and the disk locations of the emails that contain that word:

Keyword / Location

aardvark 636542

bomb 124679, 001489, 789325

... ...

zygote 987654, 123321

Now imagine that I’m the Interrogate program and my human NSA agent wants to look at all emails that contain the word “bomb”, all I have to do to make him happy is consult my table of associations between keywords and disk locations, go to each location (124679, 001489, 789325) and display the full email located there.

By this time dear reader, you may have surmised that these keywords that link to and allow rapid access to individual emails are the metadata the NSA originally claimed to be outside the purview of the fourth amendment as they are not the actual data itself. If you believe that, I have a lottery prize for you to claim.

Search This Blog