Data Mining, Spectral Analysis, and All that Jazz

Friday, August 29 2003 @ 01:45 AM EDT

Contributed by: PJ

We're not born knowing what spectral analysis is. So when SCO said that spectral analysis is one of the methods they used to find "infringing" code, I had no idea what they were talking about. When Sontag compared it to finding a needle in a pile of needles, I figured it wasn't much use. And it turns out, on further investigation, that my intuitive conclusion may be about right, at least when it comes to using it for software code data mining for infringing code in this case.

An alert reader noticed something interesting. One of Canopy Group's companies is called DataCrystal. Could that be at least one of the three groups SCO hired to try to sort through the code of UNIX SystemV and the Linux kernel? DataCrystal does "advanced pattern recognition" and "AI systems". They actually claim to do a great deal more besides. One of the things listed on their what-we-do page is data mining. Presumably, that's what SCO wanted to do. And a look at their About page indicates that if you are the RIAA, you probably would want to have a company like DataCrystal to hunt down pirates for you. Another reader noticed this page about a DataCrystal, and he wondered if it might be the same company.

It isn't, because this DataCrystal is the name of a project at USC, not a company. While I don't know if the Canopy Group company DataCrystal was hired by SCO, or whether there is any connection between the company and the project, it did make me start to wonder about the field in general. If you really wanted to know if two piles of code had identical or similar code, can data mining find out? And would matches be reliable for use in the way SCO apparently is using them? Judging by the SCOForum demo, we might think no. And we might be right.

I asked a Groklaw resource person, a man who worked for over a decade doing basic and exploratory research for the US DoD and the Canadian Ministry of Defence on topics related to secure communications and signals intelligence, including cryptology, statistical processing of natural language, signal processing, and computational learning, if he'd be willing to explain it in general and understandable terms, so we can follow along. Very likely this subject is going to be a very significant part of the case when it goes to trial. Here is what he explained to me:

"Data mining is looking for patterns or similarities in large quantities of information. Google is a good example of data mining-on-demand where the pattern is supplied by the user and the large quantity of information is the entire set of webpages on the internet. But data mining in general is potentially much broader. For example, a typical data-mining task might be to take a sample document and look for other documents in a database that might be similar to it. But even beyond that, data mining can be applied to other kinds of data -- pictures, for example, or sound recordings.

"There are lots of different ways to approach problems like this. Beyond the most elementary, what all the techniques have in common is that they rely on mathematical models and transformations of the data. Part of the reason is efficiency, since turning the problem into math usually means there's a computationally clever way to do it. Another part of the reason is that, by transforming the problem into math, you make it possible to find and grade a continuum of approximate matches -- in short, to find ranges of similarities rather than just identities. Note very well that 'similarity' here is completely dependent on the particular flavor of math you've chosen as your technique. This is extremely important.

"OK, so you've taken your document or picture or whatever, and you've mined your database for similar items. Those items will be graded for similarity to your original, just as some search engines will rate their returned items in terms of probable pertinence. The most sophisticated and respectable data-mining systems will be using grades based on probabilities. This is because the underlying math will be using probability models. Many times the grade will reflect not merely the strength of a match in terms of probability, but also the likelihood that such a match would be found at random searching any old data at all. This also is extremely important, since 'any old data at all' can be subject to a wide range of interpretations. (This could pertinent in the SCO case, since, if data-mining techniques are used, it's a reasonable question whether any contamination discovered this way is real, or whether it's spurious, i.e., capable of being found to the same degree in other, unrelated data.)

"Now the DataCrystal webpage consists mostly of a laundry list of any and all of the subjects ever associated with data mining, artificial intelligence, knowledge discovery, or machine learning. But the .pdf white papers all focus on using data-mining techniques for indexing and retrieving digital video and audio. What's more, they're offering not just indexing and retrieval services, but also housing, protecting, and distributing the data itself.

"It outlines an enhanced technique for expanding data-mining coverage. It's a technique for building patterns out of patterns and data mining on the derived metapatterns in turn."


Not being a rocket scientist, I wanted to be sure I'd understood, so I wrote back and asked these followup questions, and got this reply:

"Q: I have two questions to follow up:
" 1. . .the results would depend on how you programmed the software? In other words...it can look for similarities, but it can't evaluate them?"

"ANS: Absolutely correct.

"Q:..there might be in actuality no common code at all?"

"You know how Google sometimes matches all the words in your query, but not necessarily conjointly or in the same order?

"In the case of computer code, especially code written in C expressing similar or common algorithms, it would be astounding if there weren't pattern similarities at some level. If nothing else such things are enforced by the design of the language and commonly-held notions about good coding style."

"Q: ...it simply would have to be the case that some of the code is close enough that they might have a case?"

"ANS: Just the contrary. As with the 1st slide example, the ancestry of that memory-management code is known to virtually anybody who's studied C from Kernighan and Ritchie's book. A similarity like that would stand out like Devil's Tower, but what it indicates is exactly the opposite of what they contend: it shows that everybody knows the pattern."

"Q: And can they program the math to increase "matches"? Pls. explain a bit more this part."

"ANS: Here's an example. Suppose you came up with a hitherto-unknown page of blank verse. The question is, was it written by Shakespeare or not?

"Data mining your way through that problem, you'd get one level of certainty if your database contained the Bible, Goethe, Racine, Pushkin, and the New York Times. You'd get a different level of certainty if your database were confined to Elizabethan dramatists. The scores for putative Shakespeare against the mixed database would probably be huge just for matching any English. The scores against Elizabethan dramatists would probably be quite a bit weaker, but clearly more conclusive. The mixed-database test -- the one with the Bible, Goethe, etc. -- will probably say 'Shakespeare indeed!' but it's expressing the idea that 'if it's English it's Shakespeare.'

"On the other hand, the Elizabethan dramatist test might say yes, might say no, but the answer will be based on such things as a small number of very subtle differences between, say, Shakespeare's and Marlowe's vocabulary. It expresses perhaps the idea that 'in any 1000-line chunk of Shakespeare and any 1000-line chunk of Marlowe, Shakespeare is likely to use the word 'ope' once and Marlowe not at all. This example doesn't use 'ope' at all therefore it's probably Marlowe.

"You can see it's still a matter of interpretation and probability, but the second test is simply more credible on grounds that are external to the data-mining method itself.

"Here's another point of view. How does a data-mining search for SVr4 code look if you run it against all C programs? In all likelihood you're going to find some matches. Are the matches against Linux actually any stronger than matches against an arbitrary body of C code? Against other Unix-like kernels? etc.

"These are interpretive issues, but there are statistical grounds for deciding them, and speaking strictly for myself, I seriously doubt they've been fielded satisfactorily. For my money you couldn't even start taking the matter seriously unless exactly the same tests were run against every body of  other kernel code like all the BSDs, and a chunk of the SVr4 kernel against the rest of that same kernel. And even then, you've only generated the raw information to start the business of verifying and refining the procedure."

"Q: Also, what is spectral analysis? Is that what this is?"

"No. In general, spectral analysis refers to breaking things down into component frequencies -- sort of like how a prism breaks white light into colors, and so on.

"In this case it refers to using the periodicities of the individual characters of program text as frequencies to look for a very specific set of 'colors' associated with a particular swatch of program code. It's not determinative either. It may also refer to a kind of computational trick using spectral-based techniques to look for certain kinds of approximate matches very quickly."


So, there you have it. At least now we know in general what they are talking about. As the case goes forward, and more is revealed, no doubt it'll be interesting to meaningfully follow along.

His analogy to Google made it all come clearer to me. On top of all that he wrote, I know with Google, input affects output. And input means humans, imperfect humans. I certainly know that I get different results from Google if I plug in the identical search terms, but in a different order, for example. So I totally get how results could be skewed by what you tell the software to do. For example, I get different results if I search for "Dave Farber" and IP than I do if I search for IP and "Dave Farber", and it's different still if I search for just IP or just "Dave Farber" or just Farber or just Farber and IP. And that's using the same pile of data. Input affects output.

Obviously they would argue that their methods are so refined, blah blah. But that human element can't be removed, because humans write the software, no matter how sophisticated. So how reliable are the matches? You use Google. What do you think? Doesn't a human at some point have to interpret the value of the results?

"A continuum of approximate matches" does not infringement prove, on its face. As he says, it's an interpretive issue. And data mining seems to be a better match with something like matching amino acid strings than figuring out if someone stole somebody's code, which requires knowing who has or doesn't have a valid copyright, which way matching code travelled, who had the code first, etc.

If I've understood what my friend has written, it means that if SCO swapped out Linux and searched Windows 2000 code instead, it'd likely find instances that looked like "infringing code" also. That's the same as saying that so far, they are holding maybe nothing. It all reinforces in my mind that, once again, nothing has been proven to date by their claims of similarity, derivative or obfuscated code matches, and nothing can be proven using data mining techniques, until this case goes to trial and the experts speak, followed by a decision by a judge.

If you are interested, here is a white paper, "Text Mining -- Automated Analysis of Natural Language Texts" that explains the process of searching just for simple text, and while it does the explaining, it also shows just how much human input goes into structuring your search before you begin the search and why the results still may not be what you want. It is hard to see how such techniques could answer the question: "Is this infringing code?" At best, it could show you where to begin to investigate. And here are the DataCrystal project's white papers.

Oh, one other thing I found out in my investigation. Guess where most of the cutting-edge brains working on such data-mining techniques work? . . . No, really. Guess. . .

That's right: at IBM.

95 comments



http://www.groklaw.net/article.php?story=271