Stealing other people’s texts just got harder

Counterfeiting has exploded in the age of Covid-19. As more people work from home and take classes via Zoom without direct personal supervision, the temptation to co-opt someone else’s work has grown exponentially, as have increasingly sophisticated ways to copy someone else’s work.

Tricks like replacing a letter like “o” with a similar-looking character in a non-Latin alphabet, or using “invisible” white-highlighted text to trick current copyright detection programs have become commonplace.

The average percentage of plagiarism before and after Covid rose from 26% to 45% in the Netherlands, from 37% to 49% in France and from 42% to 53% in India, according to a survey of 51,000 college and high school students dated Anti-plagiarism software maker CopyLeaks.

The solution is no longer the same – where software checks a database for copied words and paragraphs – but the use of artificial intelligence (AI), which not only compares words to words, but also “meaning for meaning,” explains Alon Yamin, CEO of CopyLeaks.

The seedy Israeli startup is used by schools and organizations around the world including Macmillan Publishers, Stanford University, the BBC, Medium, the National Space Society, the United Nations, Cisco and Accenture, as well as by students, bloggers and journalists.

CopyLeaks’ extensive customer list not only shows how widely its software can be used, but also how pervasive the plagiarism problem has become.

Schools are perhaps the most important use case for anti-plagiarism tools, but publications and book publishers can also use CopyLeaks to ensure their authors have not—even accidentally—misused someone else’s work (journalists, for example, often paraphrase text from another article ). , provided they’ve made enough changes to make them their own; otherwise the publication could be prosecuted).

Content Abuse

Companies that develop corporate websites are another source of potential clients for companies like CopyLeaks. Here the advantage is reversed – has someone else copied your work?

It was through the latter that CopyLeaks co-founder and CTO Yehonatan Bitton found his calling in the anti-plagiarism field.

In 2013, Bitton was developing content for a family-owned website when he realized it was being copied from competing websites. The theft was frustrating, but even worse, these multiple sources of identical content drove the site’s search rankings down and negatively impacted sales.

Bitton searched for a software solution to detect such content abuse, but could not find any. He then brought the idea of ​​building something that might solve his problem to Yamin, his then work colleague and fellow graduate of the IDF’s 8200 Signal Intelligence Unit.

Yamin was instrumental in developing AI and machine learning algorithms for the Israeli Army’s intelligence agency; This technology became the basis of CopyLeaks.

CopyLeaks CEO Alon Yamin and CTO Yehonatan Bitton. Photo courtesy of CopyLeaks

promote authenticity

Approximately 70 million copyright infringement cases have been uncovered by CopyLeaks technology from 75 million pages scanned and 58 million documents compared.

CopyLeaks uses AI to understand an author’s “voice”. This goes beyond words, where automated tools can “play with the text, changing words and their order, making it easy to mask plagiarism,” Yamin tells ISRAEL21c.

“Even if not a single word is identical, we can tell if the meaning or the sentence structure are very similar.”

It’s not beyond the capabilities of human readers, “but we can do it in an automated way at very high volume.”

And in a growing number of languages: CopyLeaks currently supports over 100 languages, including Hebrew and Hindi.

CopyLeaks can help schools and publications prevent intentional or accidental copyright infringement, but it’s also a way to “authenticate yourself to ensure you’ve paraphrased enough that you’ve properly attributed all of your citations.” Our goal is to promote authenticity,” says Yamin.

An example of a CopyLeaks report. Image courtesy of CopyLeaks

The interface shows side-by-side comparisons of the original text on the left and the marked-up text on the right, complete with links to the source from which it was taken. Reports can be downloaded as PDF.

“A CopyLeaks scan [for plagiarism] can take anywhere from a few seconds to a few minutes depending on factors like the size of the document or the number of results,” says Yamin.

On demand or always on

CopyLeaks can be used as a site license purchased from a school, institution or publication; by individual authors who pay based on the number of words and pages reviewed; or integrated into an existing LMS (Learning Management System).

The technology works with most of the leading LMS including Moodle, Blackboard, Canvas, Brightspace and Schoology – these cover around 90% of academic institutions. The software can run on demand (upload file and click scan) or run continuously in the background.

Pricing ranges from $10 per month for 1,200 pages per year or 300,000 words to $566 per month for 120,000 pages per year and 30 million words. Prices for large institutions are adjusted to their specific needs. There is also a free trial that allows users to pedal for around 10 pages per month.

CopyLeaks supports 25 file types, including image files, where OCR (optical character recognition) algorithms detect objectionable content. It can even scan computer code that programmers write as part of application development.

Customers can set how sensitive the software should be; There are six different levels. “Some customers are only interested in copy/paste type plagiarism. So the sensitivity will be very low. Others care about anything that could possibly be similar, so the level of sensitivity will be very high. You can play with it and see what results are relevant to you in your use case,” says Yamin.

CopyLeaks recently launched a new tool: grading written essays using AI.

“We ran a pilot project with the Ministry of Education in Israel. Compared to human raters, we were only one point out of 100 points. It’s very accurate and fast – we can do it in just five minutes. And it’s completely unbiased,” says Yamin.

A global problem

CopyLeaks isn’t the only plagiarism detection tool keeping writers on their toes. The 800-pound gorilla in space is Turn It In, which was acquired by Advance Publications in 2019 for $1.7 billion.

Turn it In, in turn, has been busy acquiring smaller competitors, resulting in a David vs. Goliath-style showdown for CopyLeaks, which employs just 25 people across its two offices (Kiryat Shemona in Israel for R&D and Stamford, Connecticut for sales and marketing). ).

And while it’s a long way from the nearly $2 billion Turn It In, CopyLeaks just raised a $6 million Series A round, adding to Connecticut’s $1.8 million in 2018 Innovations (hence the reason the HQ is in Stamford).

Yamin notes that CopyLeaks has more than 200,000 people using it every month and a few hundred other B2B (business-to-business) customers, such as publishers and schools.

What about the types of essay factories typically found in college campus fraternities? Will CopyLeaks put these out of business?

If you’ve paid someone to write completely original content, that will be hard to tell, Yamin admits, but if the same student submitted an essay he or she wrote independently, CopyLeaks can compare the “vote” to determine if it is the same.

CopyLeaks has so far focused on text and images, but Yamin says scanning of other media is coming in the future, including copyrighted videos posted to file-sharing sites.

Is there any geography that’s particularly egregious when it comes to copyright infringement? Jamin says no. “It’s really a global problem. It happens everywhere.”

More information on CopyLeaks can be found here

How to recognize plagiarized texts

Software may be the best way to spot plagiarized texts, but the human eye can still spot some of the most egregious withdrawals. Here are the key areas to monitor, according to CopyLeaks:

  • Incoherence in writing style or sudden changes in writing pattern.
  • Writing style varies from word to word or in different paragraphs.
  • If the document does not relate to the specified topic.
  • References or sources not recommended in class.
  • Drifts and shifts in the topic.
  • Various citation methods
  • Variation of font and size between paragraphs.
  • Multiple references without citation.
  • No citations, but extended cited sources.

Comments are closed.