We are pleased to announce that we have begun preserving and providing access to crawls (snapshots) of the City’s website using Archive-It, a web application developed and managed by the Internet Archive. Archive-It uses an open-source crawler called Heritrix to crawl specific web content based on instructions provided by the user (in our case, that’s us), and the venerable Wayback Machine to provide access. Over time, the preserved crawls will show how the City’s website has changed in terms of content, look and feel.
How it works
Each crawl directs Heritrix to one or more “seed” URLs, which you can think of as the starting points of the crawl. From each seed, Heritrix browses through all links and saves any content it encounters that falls within the scoping rules for the crawl. Crawled content is saved in the WARC file format, an ISO standard for storing web content. Continue reading