Archiving

In Winter of 2019, I was reading Everywhere Archives, Transgendering, Trans Asians, and the Internet which used a youtube video by “Zach” as one of it’s focal points. Ironically enough, the piece with “Archives” in it’s name referenced a youtube video that was no longer up. I was curious, so I went to see if I could find it. Alas, I could find it nowhere. Not archive.org, not r/datahoarder, not even sketchy youtube dumps1. Online content is rapidly undergoing link rot and queer media is especially at risk.

The original citation in EVERYWHERE ARCHIVES: Transgendering, Trans Asians, and the Internet

Archives

Queer Content

Queer content is highly susceptible to link rot - factors that generally conspire to hinder queer lives (employment discrimination, homelessness, education discrimination and increased medical costs) collude when it comes to paying bills. And when someone needs to make hard decisions, one might reasonably expect server costs are one of the first things to go.

Archiving - preserving content so that we may revisit it later - is extremely important. Archives are collections of primary sources that, in this case, give us a a portal to explore queer history. That tumblr discourse might not seem important, but who knows now if you’re seeing the first notion of a larger change that will unfold in the future. The best archives are built proactively, when the content is still hot - this is where YOU come in!

Submit content here to be archived!! Submit a home page to archive the entire website, but submit a specific page to archive just that page. If you’d like a link to the archived content you can share your contact details.

Websites

Reddit

I’m archiving the following subreddits monthly with my fork of redditPostArchiver:

aaaaaaaarrrrro
aaaaaaacccccccce
actuallesbians
agender
ainbow
Androgynoushotties
aromantic
Asexual
asexuality
ask_transgender
askgaybros
AskLGBT
AskMtFHRT
asktransgender
autogynephilia
bdsm
BDSMcommunity
bigonewild
bisexual
bisexuality
Bisexy
broslikeus
butchlesbians
butchlesbianselfies
ButchSelfies
cartoon_gaiety
collegeboys
ComingOutSupport
crossdreaming
crossdressing
Crossdressing_support
demisexuality
detrans
Drag
DrWillPowers
dyke
dykesgonemild
dykesgonewild
egg_irl
ennnnnnnnnnnnbbbbbby
estrogel
femboy
feminineboys
FlexinLesbians
foreskin
ftm
FTMfemininity
FTMFitness
FTMMen
FtMPorn
ftmskype
gaaaaaaayyyyyyyyyyyy
gay
gayanime
gaybears
gayblogs
gaybros
gaybrosgonemild
GaybrosGoneWild
gayclub
Gaycouplesgonewild
GayDaddiesPics
gaygeek
gaygineers
gaymarriage
gaymers
GaymersGoneMild
gaymersgonewild
gaynsfw
gayreads
gayrights
Gays
gaysian
GaySoundsShitposts
gaysports
Gaytheists
gayyoungold
gbltcompsci
GenderCynical
genderfluid
genderqueer
glbt
GLBTChicago
GoneMildTrans
GoneWildCD
GoneWildTrans
guyskissing
happentobegay
honesttransgender
itgetsbetter
lesbian
LesbianActually
lesbiandaily
lesbians
Lesbients
lgbt
lgbt_cartoons
lgbtcirclejerk
LGBTdailyLIFE
LGBTeens
LGBTGoneWild
lgbtHavens
LGBTindia
LGBTnews
lgbtnospam
lgbtnyc
LGBTOlder
lgbtpdx
lgbtpoliticsblogs
LGBTQAdvancement
lgbtqteens
LGBTReddit
LGBTrees
LGBTreesGoneWild
lgbtsex
lgbtstudies
LGBTunes
LGBTVent
lolgrindr
lovegaymale
MaleFemme
Malekissers
malemodels
MaleUnderwear
manass
manlove
masculinegirls
MeetLGBT
meettransgirls
MtF
MtFHRT
MtFHRTsuppl
MTFSelfieTrain
mypartneristrans
naobinarie
Neutrois
NonBinary
NonBinaryTalk
noprop8
pansexual
PFLAG
Pinke
polyamory
q4q
QPOC
queer
QueerCinema
queercomics
queercore
QueerFashionAdvice
queerottawa
QueerTransmen
QueerYouth
radicalqueers
rhps
rupaulsdragrace
Samesexparents
SapphoAndHerFriend
scissoring
SRSsucks
suddenlybi
SuddenlyGay
SuddenlyTrans
TGDisc
Tgirls
thecloset
TopsAndBottoms
TotallyStraight
traaaaaaacccccccce
traaaaaaannnnnnnnnns
TraaButOnlyBees
traaNSFW
tranarchism
tranprotips
trans
trans_irl
TransAdoption
transadorable
TransBreastTimeline
TransBreastTimelines
TransClones
TransForTheMemories
transgamers
transgender
transgenderau
transgendercirclejerk
TransgenderScience
transgendersurgery
transgenderUK
TransHack
transhealth
TransLater
TransMLP
transpassing
TransphobiaProject
transpositive
TransSpace
TransSupport
transtimelines
transtrade
transvoice
TransyTalk
traps
truscum
twinks
UKLGBT
wlw_irl
worldLGBT


Submit more to be archived here - holes from my biases are clearly visible in the above subreddits, please help me fill them!

I’m currently working on a way to share these, but the files are a bit large to server via this webserver (asktransgender (text only) is ~2.4Gb and is much more with images and other media). For now, please contact me if you’d like a copy.

Software Backend

General Setup

uml diagram

Inspired by gwern’s efforts archiving darknet markets archives are created in a pipeline with three main parts:

  1. Scheduling: Content to be archived is queued for download in regular intervals
  2. Request generation: generating and filtering requests to online content for archival
  3. Source diversification: splittting requests accros multiple proxys

Scheduling

[TODO]

Request Generation

Requests are sorted by the schedular and sent to various archival tools. Some, like wget, require a paternalistic hand and are filtered through mitmproxy to prevent unsavory behaviour like downloading unnessesary pages and logging the scraper out of the website being scraped.

mitmproxy

mitmproxy is a tool that works as a Man in the Middle and allows us to filter our requests that a tool makes. Even though wget has options for accepting/rejecting links using mitmproxy allows us to assign specific http status codes to forbidden pages and mitmproxy can be scripted in python instead of just regex.

For example, gayplants.noblogs.org has links that aren’t hyperlinked which means wget will skip over them. Linkifying these textual links is as easy as:

from mitmproxy import http
from bleach.linkifier import Linker
linker = Linker()

def response(flow: http.HTTPFlow) -> None:
    try:
        if "text/html" in flow.response.headers["Content-Type"]: # we only want to modify html pages not, say, pdfs
            flow.response.content = 
            	str.encode( # re-encode text to bytes object
            		linker.linkify( # use bleach's built-in linkify function
            		flow.response.content.decode() # decode bytes object
            		)
            	)
    except KeyError, UnicodeDecodeError as e:
        pass

which would be run with:

mitmdump -s extractLinks.py --listen-port=9090

redditPostArchiver

redditPostArchiver is written in python and supports downloading reddit subreddits to a local database file. Originally written by GitHub user pl77, it worked but needed some updating to work on python 3.8. It also needed some other work (bug fixing, error handling, probably some logging and a quiet mode). I’m working on that in my forked repository here. The last original commit by pl77 was in 2018, and they have other projects that keep taking precedence.

wget

I use the following command to download a website for archival:

wget --page-requisites --adjust-extension \         
        --convert-links \
        --level inf \
        --recursive \
        --no-remove-listing \
        --restrict-file-names=windows \
        --no-parent \
        -w 1 \
        --warc-file=warc \
        --warc-max-size=1G \
        -o wget.log \
        -e use_proxy=yes \
        -e http_proxy=127.0.0.1:9090 \
        -e https_proxy=127.0.0.1:9090 \
        --no-check-certificate \
        $website

Breaking that down into readable chunks we have:

Misconfiguring your crawler is a setup for disaster! As a particurally salient example, I need only point at the first website I tried to mirror which resulted in a 22GB Log File:

.../websites/susans.org >>> du -sh ./*
45M	./warcfile.cdx
3.8G	./warcfile.warc.gz
172M	./wget.log
22G	./wget.rejection.log
19G	./www.susans.org

youtube-dl & youtube-comment-scraper

I archive channels with:

channelIDs=(0mTlVosk4bQ AiU-KZ_KADY) # array of youtube video ids
for channelID in "${channelIDs[@]}"
do
	echo "executing script in directory" "$PWD" "downloading channel" "$channelID"
	youtube-dl --download-archive "archive.log" -i --add-metadata --all-subs --embed-subs --write-all-thumbnails --write-auto-sub --all-subs --embed-thumbnail --write-annotations --write-info-json -f $youtubedlformat" "https://www.youtube.com/channel/$channelID
done"

and individual videos and their associated comments with:

videoIDs=(0mTlVosk4bQ AiU-KZ_KADY) # array of youtube video ids
for videoID in "${videoIDs[@]}"
do
	echo "executing script in directory" "$PWD" "downloading video" "$videoID"
	youtube-dl --download-archive "archive.log" -i --add-metadata --all-subs --embed-subs --write-all-thumbnails --write-auto-sub --all-subs --embed-thumbnail --write-annotations --write-info-json -f "$youtubedlformat" "https://www.youtube.com/watch?v=$videoID"
	videoDIR=$( find ./ -type d -name "*$videoID" )
	echo Attempting to write comments to "$videoDIR/comments.json"
	youtube-comment-scraper -o "$videoDIR/comments.json" -f json "$videoID"
	echo Done with $videoID
done

where this reddit post is the source for:

youtubedlformat = "(bestvideo[vcodec^=av01][height>=1080][fps>30]/bestvideo[vcodec=vp9.2][height>=1080][fps>30]/bestvideo[vcodec=vp9][height>=1080][fps>30]/bestvideo[vcodec^=av01][height>=1080]/bestvideo[vcodec=vp9.2][height>=1080]/bestvideo[vcodec=vp9][height>=1080]/bestvideo[height>=1080]/bestvideo[vcodec^=av01][height>=720][fps>30]/bestvideo[vcodec=vp9.2][height>=720][fps>30]/bestvideo[vcodec=vp9][height>=720][fps>30]/bestvideo[vcodec^=av01][height>=720]/bestvideo[vcodec=vp9.2][height>=720]/bestvideo[vcodec=vp9][height>=720]/bestvideo[height>=720]/bestvideo)+(bestaudio[acodec=opus]/bestaudio)/best" --merge-output-format mkv -o "$PWD/%(upload_date)s - %(title)s - %(id)s/%(upload_date)s - %(title)s - %(id)s.%(ext)s" 

Source Diversification

I use proxychains to route requests through multiple vpns to lower the chance I get banned by a server that I’m scraping. I use the random mode with a chain length of 1. That said, it’s important to wait a respectful amount of time between each request to be polite.

Legacy Content

wpull

Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. wpull

wpull is cool! But I quickly realized it was a bit too unstable for regular use.

Recursive Website Archiving

I briefly used the following command to recursively archive mediawikis:

wpull https://www.susans.org/wiki/Main_Page \
    --warc-move susans-wiki \
    --warc-file susans-wiki --no-check-certificate \
    --no-robots --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --page-requisites-level 1 --recursive --level inf \
    --escaped-fragment --strip-session-id --sitemaps \
    --reject-regex "Template:|Skin:|Skins:|User:|Special:|User_talk:|index\.php|\/extensions\/|\/skins\/" \
    --accept-regex "\/wiki\/" \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --database susans-wiki.db \
    --output-file susans-wiki.log \
    -v --server-response

With Susan’s Place’s wiki as the given example.

Installation

I had quite the time installing wpull. The following finnaly worked:

sudo pacman -S openssl openssl-1.0 python-pyopenssl python2-pyopenssl pyen

# start up pyenv, add to startup for the future
pyenv init >> ~/.zshrc                                                                                                                                                                                                                    
eval "$(pyenv init -)"

# install new python version
sudo su
CONFIGURE_OPTS="--without-ensurepip" CFLAGS=-I/usr/include/openssl-1.0 \ LDFLAGS=-L/usr/lib64/openssl-1.0 \
# the stated version compatible with pull (3.4.3) doesn't work 
# abc.collections.Generator was introduced in python 3.5
# Let's install 3.5.9
pyenv install 3.5.9 
exit # exit su :eyes:

# check things are working
pyenv shell 3.5.9
python --version #> Python 3.5.9

# download wpull's dependencies
wget https://raw.githubusercontent.com/ArchiveTeam/wpull/develop/requirements.txt    

which (at the time) output:

chardet>=2.0.1,<=2.3
dnspython3==1.12
html5lib>=0.999,<1.0
lxml>=3.1.0,<=3.5
namedlist>=1.3,<=1.7
psutil>=2.0,<=4.2
sqlalchemy>=0.9,<=1.0.13
tornado>=3.2.2,<5.0
typing>=3.5,<=3.5.1
yapsy==1.11.223
pip3 install -r requirements.txt
pip3 install html5lib==0.9999999 
pip3 install wpull

# reset shell to default python
pyenv shell system

and then wpull can be run with

PYENV_VERSION=3.5.9 pyenv exec wpull

I set an alias to make things easy and make my script compatible with other systems:

echo 'alias wpull="PYENV_VERSION=3.5.9 pyenv exec wpull"' >> .bashrc  

  1. Please let me know if you find it!! ↩︎