If you’re an online publisher of any kind, Jonathan Bailey has some sobering news for you: “If you ping out an RSS feed, it’s almost guaranteed someone is trying to scrape and steal your content.”
Bailey, a writer and Web developer, is Founder of PlagiarismToday.com, a blog dedicated to the all-too-common practice of taking copyrighted material and using it across the Web without attribution. Almost anything is up for grabs, whether it’s an article, a blog post, a photograph or a unique graphic element of your site. And the practice could be costing you readers, reputation or search engine ranking.
Bailey -- who points out that he is not a lawyer -- became an online plagiarism expert after his own run-in with content theft. He discovered that people were plagiarizing essays, poems and short stories from his fiction Web site. Now, he consults with publishers of all kinds who want to protect themselves from plagiarists.
We asked him to share his tips for understanding how plagiarists work and how to find them and stop them when they’ve stolen your materials:
-> Tip #1. Material that is typically plagiarized
“Plagiarists tend to favor anything that’s easy to steal, and the easier it is, the more likely it is to get stolen,” says Bailey.
This means text and images are often plagiarized, because it’s easy to strip the real attribution from those pieces. Video and audio files aren’t often plagiarized because of the difficulty of removing the identifying characteristics of those files.
That said, here are the types of text and image files that plagiarists often target:
- Articles. The shorter the work, the more likely it is to be plagiarized.
- Blog posts. The most commonly stolen pieces, but any articles sent out in an RSS feed are easy to scrape.
- Marketing copy.
- Images and photographs, especially from fine arts or professional photography sites.
- Web-design elements such as custom buttons, banners and icons that give your site a unique look.
- Templates or samples, such as crochet and knitting patterns.
-> Tip #2. Different types of plagiarists
Online plagiarists tend to fall into three categories, says Bailey, with characteristics that affect how to find them and stop them:
- The professional plagiarist. These individuals plagiarize works to further their own career. “Getting caught is the worst thing imaginable for them -- it’s the end of the gravy train.”
- The profiteer. These plagiarists take content solely to make an immediate buck, either by selling content outright or, more commonly as fodder for spam blogs. Spam bloggers scrape whatever content they can from other sites to generate artificially high each engine rankings and translate that into advertising dollars. “They don’t even care if anyone believes the work is really theirs. They’re just trying to trick the search engines.”
- The confused. These are people who plagiarize information for their blogs and Web sites for personal reasons -- such as to impress people -- not to make money or to further their careers. Bailey says they tend to be teenagers or college students.
-> Tip #3. How plagiarists do it
Grabbing content from around the Web is easy, with techniques ranging from basic cut-and-paste to more sophisticated automated schemes and attempts to alter stolen content.
- RSS scraping is the No. 1 way plagiarists steal content. Automated software similar to RSS readers allows plagiarists to sign up for RSS feeds from any number of blogs or Web sites and then scrape all new posts from feeds and post them to their own site. One RSS scraper can set up thousands of sites in an hour, says Bailey, scraping relevant content from legitimate sources.
- Simple copying and pasting also is common, with plagiarists selecting text and pasting it onto their own site or right-clicking images to save them to their own computers.
- Image hotlinking allows plagiarists to steal images from another Web site, but have that image remain on the owner’s server. “That is particularly bad because it’s not only plagiarism, it can cost you bandwidth and money every time someone hits that file from another site.”
- Synonymizing is a variation of RSS scraping that searches the content and replaces a handful of key words with synonyms to hide the fact that it’s a plagiarized piece. For example, a blog post containing the sentence, “The cat went into the house,” might become, “The feline went into the home.”
The problem is, the software can’t understand all the nuances of language and syntax and creates alternative sentences that often don’t make much sense. That’s why synonymizing is mainly used by spam bloggers who don’t care if the text is readable, just that it has keywords that get indexed by search engines.
-> Tip #4. How to detect plagiarism
Bailey’s advice for finding plagiarists: “Google is your friend.” Because most plagiarists want search engines or people to find their sites, you can use that need for attention against them:
- A basic technique is to choose a unique phrase from the text of your articles and blog posts, type it into Google and see what comes up. Or, instead of manually conducting keyword searches for your content, use the Google Alerts tool, which will automatically email you whenever a new page with a matching phrase comes up.
- Coming up with new keyword alerts for blog posts or other frequently changing content is challenging, so in these cases consider creating a simple digital fingerprint for all your articles or posts. This can be an uncommon word or combination of letters and numbers that you affix in the footprint of your RSS feed, so that all new posts are tagged with it. This way, you can set up a Google alert to find your digital fingerprint.
Finding plagiarized images is tougher, because image searching is less effective than text searching. A publisher’s best bet is to prevent image plagiarism, rather than try to find it when it happens.
- Watermarks and image overlays are the best way to prevent plagiarists from stealing your images. Some vendors offer high-end, invisible watermarking services that also track the appearance of copyrighted images on the Web, but these services are expensive and are likely to be used only by professional photographers or stock photo sites.
- For sites that can’t use watermarks on their images, because they’ll interfere with how the photos are used on the Web site, publishers can try saving all images with a unique file name, something that no one else is likely to use. Because plagiarists don’t typically change file names on the images they steal, you can do a Google image search for the key word or phrase in your file names. “It’s not ideal by any stretch of the imagination, but it can work.”
- Don’t be tempted to turn off the right-click script to prevent people from downloading your images. Bailey says the technique doesn’t really stop theft, since people intent on taking an image can find other ways to do it. You’ll also infuriate Web users who rely on the right-click feature to navigate the Web.
No matter what kind of content you publish, Bailey recommends that all sites enlist loyal visitors in the effort to find plagiarists. Encourage your readers or subscribers to contact you any time they see something online that looks like it was taken without permission. “They have eyes and ears all over the Internet, and they visit sites similar to yours all the time, so they’re likely to be the first to see any kind of copyright infringement.”
-> Tip #5. How to get plagiarists to remove your materials
Once you’ve found another site that’s trying to pass off your copyrighted materials as their own, you can take five steps to fix the situation.
#1. Cease and desist letter. This is the first step to take, and Bailey says there are several places to find templates for such letters (see links below). The goal is to inform the site operator that they are using your materials inappropriately and request that they remove the content before facing further action.
- Cease-and-desist letters are usually sent by email to the site’s operator, but if you can find a physical address, also send a copy via certified mail to confirm that they’ve received it.
- Cease-and-desist letters usually only work with people who’ve simply made a mistake and failed to attribute the source of the content. Unrepentant plagiarists or spam bloggers are likely to be unmoved.
#2. Notice and takedown procedure. If you can’t get through to the site’s operator, contact the Web host. The Digital Millennium Copyright Act contains a provision that requires Web hosts to remove infringing content when given notice by the rightful copyright holder.
- Track down the site’s host by performing a whois search at sites such as Domaintools.com or Network Solutions.
- Once you’ve identified the hosting company, check their website for a DMCA contact. If it isn’t listed, you can find a list of DMCA contacts at http://www.copyright.gov
- You must follow a specific format for a notice and takedown request (available online, see links below), but US Web hosts should remove the material within 72 hours.
- The European Union and many other countries, such as Australia, have similar policies, so if you find plagiarized material hosted by overseas companies, look for that country’s notice and takedown procedure.
#3. Blocking search indexing. If the site is hosted in a country without a notice and takedown rule, such as China, ask the major search engines to remove that site from their rankings. Search engines are governed by a similar provision that requires them to stop indexing sites as soon as they receive notice that the site is using copyrighted material without permission. “The plagiarism will still be out there, but at least it’s not impacting your own search rankings, and no one else is going to find it anyway.”
#4. Removal from ad networks. Plagiarists who sell ads on their sites, such as spam bloggers, can be kicked out of advertising networks if you give notice that the site in question is infringing on your copyright. “You can get those sites closed down, and that really hurts because it hits them in the wallet.”
The major online ad networks have their own procedures for handling copyright infringement. Check the links below for Bailey’s list of DMCA contact names.
#5. Lawsuit. Despite the protections of copyright law, filing lawsuits against plagiarists is a very rare procedure. For starters, copy-protected material must be registered in order to bring a case in federal court, and Bailey says the vast majority of content online isn’t registered.
It’s also rare because you have to find a lawyer willing to take the case, which requires a particularly egregious case of plagiarism and a clear-cut example of financial damages suffered by the copyright holder. So while you may be tempted to get back at a plagiarist, the best course of action is usually to get your materials removed from the site in some way, then be vigilant for any future violations. Useful links related to this article
Network Solutions Whois search:
DMCA contacts for a range ad networks, blog networks, major Web hosts, search engines, and more:http://www.plagiarismtoday.com/dmca-contact-information/
U.S. Copyright office list of contacts to notify for cases of infringement
Sample cease and desist letter
Sample DCMA takedown notice: