Tool to catch plagiarism
If you copy somebody else's blog entry verbatim, credit the original author and link back to the original post.
Sometimes I'll google my own topics to learn more about what other people have to say about it. I stumbled across some blatant plagiarism. While that was annoying, the cool thing was it hit me that you could write a tool to search for blog plagiarism:
1.) Have some some tool which reads through a blog feed. For each entry in the feed:
2.) use a search engine to search for a large part of the entry's text. Perhaps search a paragraph at a time since there's a higher chance of copying a single paragraph instead of the whole document. Since a whole paragraph is a pretty specific search, you'd expect only a few matches.
3.) scan each search result (skipping the ones for the original post, of course!) for a hyperlink back to the original blog or for the author's name. If there is no such reference, the search result is likely plagiarizing the blog entry.
It seems like it should be pretty straightforward. It's mostly glue around an RSS reader and an search engine API . (Actually, it sounds so simple, I bet such a tool is already out there. I expect this is a common problem with schools and student papers)
As a sanity check, I tried this method out be hand with an example search using MSN Search on my post about 0xFeeFee sequence points. At the time of writing (8/20/05), there are only 3 different matches: my original post, this, and this. (For each match, there's actually a blog entry and an archived blog entry, so there were 6 total matches). When I pull up the source HTML for each of the results, I can see the 2nd one does not include any reference (either my name or blog URL) back to me; whereas the 3rd one does. So the tool could automatically flag the 2nd one as plagiarizing.
Offhand, I don't know how to automate the search APIs. If I do end up writing such a tool, I'll be sure to post back. (Update: I wrote the tool and it's available here)
Comments
- Anonymous
August 20, 2005
Interesting, and sad.. I hope plagiarism doesn't become a big problem for bloggers. - Anonymous
August 20, 2005
I wrote up some preliminary stuff and it's interesting and promising. Some issues I observe:
- It would be easy for this to generate false positives. For example, if both your blog and blog X quote article Y, blog X may not link to you. The tool needs to be smart.
- The RSS / atom feed is a great way of getting the input data, but it only works for recent entries.
My suspicion is that you could write a tool that could present you with some reasonable plagiarism candidates - but it will be difficult to make it more than 95% sure. - Anonymous
August 20, 2005
There is something called Simian (http://www.redhillconsulting.com.au/products/simian/)which is supposed to search for duplicate text, maybe it can be adapted to this? - Anonymous
August 21, 2005
For kicks, I started writing a tool to use internet searches to automatically catch plagiarism. The first... - Anonymous
August 21, 2005
There is indeed a tool that does just this...
I can't for the life of me remember what it's called or where it is... I'll try and find it - Anonymous
August 21, 2005
Here it is... not quite how i remembered it, but it might help
http://copyscape.com/ - Anonymous
August 21, 2005
For kicks, I started writing a tool to use internet searches to automatically catch plagiarism. The first... - Anonymous
August 22, 2005
I am a small-time webmaster with several webpages I maintain simply for my own satisfaction and for the enjoyment of other people like me. Recently, I found a corporate site that has, in my opinion, plagiarised my site, it was very frustrating. - Anonymous
August 22, 2005
Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it... - Anonymous
August 23, 2005
The comment has been removed - Anonymous
August 24, 2005
Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it... - Anonymous
August 28, 2005
There is even such tool out there to catch blog's plagiarism. LOL!http://blogs.msdn.com/jmstall/archive/2005/08/21/plagiarism_tool.aspxSomething...