Content Scraping, Part 1: High-Tech Plagiarism

Plagiarism has existed since the days of feather pens and parchment. If one person wrote something worthwhile, there was always someone else who wanted to steal that writing and pass it off as their own. Today, plagiarism is much more high-tech as the Internet has created a “wild west” environment in which all content is often perceived as fair game and up for grabs.

Content scraping is the practice of stealing content from a reputable website and posting it on another website without the content owner’s permission. Most content scraping is performed by sophisticated software or programming – also known as bots – but it can be done manually with a simple “copy and paste.”

Why does content scraping occur?

Just like a high school student might copy information found in the encyclopedia to get an “A” on a report, a content scraper will steal from a legitimate website to improve the search ranking of their own not-so-legitimate website, taking advantage of quality content and keywords.

Search rankings are increasingly driven by fresh, high-quality content, and studies consistently show that creating original content is the biggest challenge of content marketers. The constant demand for marketing content has spawned a cottage industry of so-called content developers, many of whom charge a few cents per word. Similarly, many web developers will include content services in their fee. These outfits offer such low prices because they brazenly – and often illegally – scrape content from established websites.

What qualifies as content scraping and what doesn’t?

Copying an entire piece of content from someone else’s website, copyrighted or not, and putting it on your website clearly qualifies as content scraping. However, if someone uses certain facts from your content, there isn’t much you can do. You can protect content, but not pure facts. Even if you forbid this practice in your online Terms of Use, many courts won’t enforce these terms because the vast majority of website visitors don’t read or agree to them.

Content syndication is not content scraping. When content is syndicated, one publisher has granted permission to other sites to publish or reprint their content, usually a blog, using an RSS feed or other software. This permission is given in order to expand the author’s exposure and always includes proper credit and links to the original content.

One gray area is the use of attributed excerpts. It has long been common practice to use a paragraph or two from a published article as long as you include the publication’s name, the author’s name and a link to the original content.

However, many publishers are concerned that this practice is now hurting their readership and profits because website visitors don’t always click through to the original article. The issue being addressed in several pending court cases involves defining the line between a legal excerpt and illegal copying – or content scraping.

The rules and definitions of various forms of content scraping are evolving as we speak. However, there is clearly a movement to crack down on the practice, and search engines like Google are penalizing websites for duplicate content.

In Part 2, we’ll discuss the legal repercussions related to content scraping and what you can do to prevent it.

Content Scraping, Part 1: High-Tech Plagiarism

Cheryl Cooper, Esq.

The Weight of the Business World

Rolling with the PPP Punches

The Rise of the Gig Economy Worker