the importance of archive.org

The Importance of Archive.org

Today we’re going to talk about the importance of the “Internet History” provided on a per-website basis by Archive.org via the Wayback Machine that allows checking the history of a domain prior to its purchase. 

According to Matt Cutts, former Head of the department fighting Web Spam at Google, domains can carry prior penalties imposed by the search engine over to the new owner if they are not accurate enough in performing its due diligence when purchasing a domain. As he explains further, one of the most comprehensive ways of checking the prior history of the domain, and thus ensuring it wasn’t used for SPAM, is using the internet archive at Archive.org.

Archive.org allows its users to see what the previous versions of the website you’re interested in looked like. In other words, it allows one to see if the website was previously used for legitimate purposes or whether it was used for or associated with SPAM. For instance Google.com has almost 8.5 million captions in time, thus allowing you to see how it looked like either this morning or way back in April 1999, when it was still in beta.

The Wayback Machine, released back in 2001, collects and stores current versions of web pages until they become of value at a later point in time. The tool’s mission is to support the Internet Archive (archive.org) in building a digital library of Internet history through snapshots of website pages taken at different points in time.

Considering the fact that webpages are constantly changing, the Wayback Machine crawlers, once they crawl and store a page, return to that page at a later date to perform the same process all over again. This allows for an in-depth analysis of the evolution of various websites and their pages around the web. 

Although the tool’s main goal is to make this information available for future generations of researchers, historians, and scholars, this data is often used by marketers and SEO professionals, who often find it valuable in their decision-making processes.

Even when you’re not doing a full-on website analysis, having its changelog can be a valuable tool when you’re working on a project that involves changes in traffic over time, for instance. Archive.org can provide an insight into the changes that have taken place on the website and have thus affected the traffic in question.

It may also come in handy when comparing old content or promotions that have run in previous years. The best part about it is that access to this information is available for free and without any prior notice.

How Does the Wayback Machine Work?

The Wayback Machine works a lot like looking at a live website, except it uses its cached version, including all HTML. This makes it easier to identify such technical or structural changes over time, as:

  • On-page meta;
  • Internal linking;
  • Image usage;
  • Dynamic portions of the page.

Most interestingly, the Wayback Machine doesn’t work with the Home Page only, but rather with any web page you might put in. It should be noted, however, that the Wayback Machine database is still incomplete and works best with highly popular pages. 

Note the color-coding of the dates:

  • Red means there was an error;
  • Green indicates a redirect happened;
  • Blue means there was a good cache of the page.

For larger sites, such as Google, you will find that homepages are cached multiple times per day, whereas smaller sites can be cached as rarely as a few times per year. A cached page from Archive.org will load in your browser much like any website except that it will have a header from Archive.org. You can look for changes in titles, descriptions, robots, canonicals, and JS much like you’d do that with a regular page by opening its source code.

Several Cases of Things to Look Out for

Abusive Repurposing or Redirects of the Website

Perhaps, more often than otherwise, what marketers are interested in is spotting instances of abusive repurposing or redirects of websites that would diminish the aged domain’s value. This happens when someone has previously purchased a domain and redirected it to a completely irrelevant website in the hopes of improving the website’s authority using the support of its backlinks.

Not only is this practice abusive, but it has also been deemed ineffective by Google. Only in cases when the relevancy of websites matches closely can such redirects have a meaningful impact on the brand’s online authority. 

Previously Defaced or Hacked Websites

Similarly, another thing to look out for are websites that have been previously defaced or hacked and infected with malware. You can spot such instances based on the manner, in which the website content may have been modified to link to other websites in a completely irrelevant manner. 

Have Been Part of PBNs

Although deemed highly effective in the past, another thing to avoid are websites that have been part of Private Blog Networks. Over the years, Google and other search engines have learned to spot and punish their owners for such practices. One of the best pieces on the topic is Doug Cunnington’s post on why PBNs aren’t effective anymore

Infringe Someone Else’s Trademark

Last but not least, although a bit harder to spot, the websites or domain names that should also be avoided include those that infringe someone else’s trademark. These are a bit trickier to spot, which is why we recommend using a service like Odys that will do the due diligence for you.

Share this article on

Share on facebook
Share on twitter
Share on linkedin

It all starts
with a domain name

Follow Us

Follow us on social media so we can keep in touch!