Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Should we trust Google regarding WiFi Scandal? - Paladine's Blog — LiveJournal
Why not eh?
Should we trust Google regarding WiFi Scandal?
Google have claimed to the Press and Media that the latest privacy scandal regarding their interception of Internet communications whilst sniffing out WiFi hot spots with their Streetview cars was an "accident".

They have stated that the code was being worked on for a different project and somehow managed to get inserted into the Streetview project - and frankly that doesn't wash.

Having worked on large IT projects for 15 years I have a strong understanding of the design, developement, testing and deployment cycles fo such projects, so let me explain a little how it works.

1.  The Design Phase
As the title suggest this phase is where the project is originally defined and designed.  Normally at the beginning of this phase there would be a very high level concept design which would not include any "code" as such - its purpose would be to give management and executives a human readable outline of the design principles and purpose of the project.

Once this has been signed off by management and a project leader/manager has taken control, that design concept will be fleshed out to make it ready for the engineers - this would result in documentation still at quite a high level (human readable) with perhaps some "pseudo code" but certainly nothing more.

The output from this phase would consist of lot of reference documents, technical glossary, project plan and a lot of documents defining technical functionality and specifications - these would then become the core knowledge resources for the entire project and would be used by developers, testers and even management, throughout.

2.  The Development Phase
Nothing too complex in describing this phase - it is what it says on the tin.  Using the design references and technical specifications the engineers would develop the code base for the project.  They liase with the Designers frequently and once they have some code it goes off for testing and debugging.

3.  The Testing Phase
Testing and Debugging will be heavily reliant on the technical specifications and various other documents from higher up the chain.  Test environments would be setup to mimic the real world and extensive testing of every single piece of code is carried out.  This is one of the most important phases in any IT project and it lasts a long time.  Every single byte of data which is produced by the tests is inspected to ensure that it is working as planned.  It never does, at least not in the early phases of project so there is a lot of interaction between developers and testers and again a lot of interaction between developers and designers.

4.  The Deployment Phase
In essence once a project has been thoroughly tested and is seen as stable it will be deployed into the real world - this doesn't mean that the three previous groups become obsolete - in fact they would continue to redesign, redevelop and retest in order to add new features, remove features which are not needed and deal with bugs or unexpected behaviour which was not picked up in the labs.  And believe me, these -always- manifest - I have yet to work on a large project which works as desired first time round, it simply doesn't happen.  The project manager has to deal with change requests, bugs, resource issues, efficiency issues and a whole bunch of other things.

So the question is how does a piece of code "intended" for another project entirely, manage to find its way into the project without being noticed?  The short answer is that it doesn't, it simply is not possible because of the very granular method in which projects are developed.

At the very worst it would have been picked up in Phase 3 (Testing) as the data coming back from the test environments would include all this "accidental" data and would be picked up by the people doing the testing.  At this phase in order for it to be "rogue" code one would assume there would be no technical specifications for that code which would immediately ring alarm bells with the testers as they find they have all this data which is not defined.

Even if it was missed during the testing phase (which is incredibly unlikely) it would certainly be noticed in the data coming back during the early stages of deployment - which is always examined thoroughly - you simply cannot fail to notice all this incoming data containing the contents of Internet communications.

Furthermore, one has to assume that the size of this data (considering it has been collected for over 3 years) would be significant - probably hundreds of terabytes - that all has to be stored somewhere and believe me when I say Database and System Administrators know their systems very well indeed, it is their job to know what is in their systems and why it is there - they need to know this to keep on top of resources, manage access control and backups - you can't store all this extra data accidentally, it takes physical space, money and real man hours to manage it.

So do I trust Google when they say it was accident?  Absolutely not - they knew they had the data, they knew where and what that data was and they stockpiled it for 3 years - and it is likely they would have continued to do so had Germany not demanded to know what data they were collecting.

Google may well be able to pull the wool over the eyes of regulators, press, media and the general public - but anyone who has worked professionally on large IT projects knows full well that this was no accident - it just doesn't happen that way.

Tags: , , ,

4 comments or Leave a comment
From: (Anonymous) Date: May 16th, 2010 08:37 pm (UTC) (Link)

Google and WiFi Payload data


Very insightful. However, I'm not 100% convinced collection of payload data was purposeful here (as opposed to their purposeful collection of SSID and MAC address, which I DO find problematic).

A.) It appears that google plugged in code from another project intended to capture SSID and MAC address, and that it inadvertently captured payload as well. I can see someone going over this code mistakenly thinking it will only capture OSI layers 1-3 +SSID and not realizing it's also catching layers 4-7+.

B.) Testing: I am a bit surprised that google did not catch in the testing phase that they were catching payload from WiFi unencrypted routers, but again, I don't see it as inconceivable. It could have been, for example, that the routers they were testing on were only broadcasting SSID or minimal payload data. It could be that the filters they were using to capture packets were misconfigured in the deployment phase to catch all 7 layers as opposed to at the testing phase (only layers 1-3).

C.) After Deployment: how is it Google did not notice they had captured this data until the German DPA came a' knocking? I don't know. But if they did not know it was there, perhaps they did not know to look for it. It could be, for example, that collected data was merely searched for SSID and MAC, and was not examined from top to bottom. As you note, there would have been a fair amount of data involved, and perhaps they did not wish to go through it bit by bit.

D.) most importantly, in my view, I cannot think of any benefit Google could get from such data. We're talking of a random sampling of a few seconds of payload data gathered from random computers while passing by. Certainly, the privacy invasion here is great, and we should all be angry at Google for letting this happen, but I'm not sure what great benefit there could be that may have prompted them to do this deliberately. In that sense, I see this as quite different from the Google Buzz launch, or the collection of SSIDs and MAC addresses.

Just my thoughts. Highest regards -
_paladine_ From: _paladine_ Date: May 16th, 2010 10:09 pm (UTC) (Link)

Re: Google and WiFi Payload data

Thanks for the comment let me answer each point in turn.

On Point A - Google have categorically stated that this code should not have been added to Streetview - which leads one to believe that this was in addition to the code used for capturing SSID and location data not part of it.

On Point B - in all my years working in this field I have never seen a situation where testing does not rely on both technical specifications and inspection of all data produced - this is essential to show that the project is working as designed/desired. As you probably know, technical specifications on such large projects are very detailed - so to have data which is not defined in the specifications would raise alarms - granted it could be that their testing environment used encryption on their APs but that still doesn't excuse the data being picked up during deployment (see C).

On point C - testing data from Live systems, is in my experience even more thorough than lab based tests - it has to be because once the system is Live any issues which have not been picked up during testing has additional costs which could threaten the project - especially if there is a major issue which could lead to temporary suspension on deployment.

On point D - the packets captured would include routing data so the IP of the individuals using the device - this could very easily be correlated with existing data for those IP addresses and then be used for behavioural based products - so there was additional value in this data. I am not saying that is why it was collected just that there is value to it.

But even putting all that aside - it still does not excuse the fact that this data was stockpiled for over 3 years - it should have been noticed (it is beyond the realms of belief that it wasn't in my opinion) and unless there was a desire to use it, it should have been destroyed - it wasn't and probably would not have been had the German government not forced them to disclose it in the first place.

So the point of the post was to give a high level illustration of how these types of projects are managed and how that granular management makes it incredibly unlikely that Google were not aware the data existed.

Whether or not the code was inserted in error - someone knew about it (arguably several people).
Whether or not the data was used, it was stockpiled and therefore several people should have been aware of it in order to manage it.

So to say it was an accident and they didn't know, simply doesn't wash.

Again, thanks for the response.
From: (Anonymous) Date: May 16th, 2010 11:12 pm (UTC) (Link)

Re: Google and WiFi Payload data

Fair enough. I don't think we dramatically disagree here, now that I've read your response.

You're saying it's incredibly unlikely they didn't know. I'm saying I think there's a small chance, but only if they really messed up on safeguards/processing, etc. Either way, I think they're in serious error, and potentially in criminal error.

To be honest, I'm personally a little shocked they say they were knowingly collecting SSID/MAC addresses and that, apparently, DPAs were not informed.

Thanks for the response, and best regards,
From: (Anonymous) Date: June 10th, 2010 12:37 pm (UTC) (Link)

Data suddenly appearing

I think the important question here is did Google know that extra data was in the content being captured? As a developer who has in the past for many many years reverse engineered file formats and represented them in an entirely different 'common' format I find it very difficult to believe that this data existed and was being produced undetected late in the development stage or was missed in the testing stage.

Given that Google was collecting enormous amounts of data with their street cars it would be intuitive to suggest that this data was looked at prior to application release and was tweaked to ensure that it was presented to the hard disk while capturing, with the most economical storage method available in mind. It would be foolish and incomprehensible to suggest that Google allowed the saved file format to contain 'bloated' raw data containing this extra 'unwanted' data that nobody knew about and was ignored. If I went about mapping an entire country, I would ensure that the data collection process was lean and mean for the purpose intended with no unwanted 'side affects' taking up valuable disk space. Anything other than lean and mean would be a gross error on my part.
4 comments or Leave a comment