Some thinking aloud about tagging.: antennapedia

antennapedia

Some thinking aloud about tagging.

Dec 12, 2011 15:48

Last month I was pondering taxonomies of fanfic and how to make it searchable. I wrote up some notes for myself that I dump here in case anybody’s interested and wants to ramble back at me.

One approach is a million hard-coded attributes (think ff.net’s search or Ravelry’s search or Zappo’s search). Another approach, at the other end of the spectrum, is free-form tagging. The former makes for giant heavyweight search interfaces that allow you to narrow things down very precisely. The latter makes for nice browsing (click on tags) and allows authors to describe their fic contents more deeply. It also makes omissions in the original taxonomy much less painful.

Downside: it affords tags like “omg I am so high on sugar you guyz”, but that kind of tagging is also an effective warning about the contents, so perhaps a mixed blessing.

Story traits of interest:
fandom
character
pairing
gen/het/slash/femslash
author
rating
word count
warnings
crossover-ness
trope/genre
content tags
Possibly others, but those are the ones people keep mentioning as traits they want to search using.

Fairly long list. Some are inherent traits (author, word count), others need to be assigned by author or observer. You can implement all of these by tag if you want, because tags are infinitely flexible. (This is the path that the AO3 implementer began walking down.) You can implement them all by making each one an enumerated list that the poster picks from (ff.net style). You can mix & match.

I’d just always make crossoverness a checkbox. Can deduce it from multiple fandom tags, if you trust that, but it’s such a huge want/not-want in searches that I’d just give it the special trait status and be done. Ratings should similarly be completely constrained; pick from this enum only.

Warnings: political quagmire. Not going to tackle it here.

Plumbing of the participants in any sexual encounters: perfect use for tags. Can also handle shifting slang and any emerging trends for, say, genderbending fic.

I admit to a bias against heavyweight data models like in the hard-coded attributes approach. Why structure everything so rigidly? Document schemas trap you in that schema; you can only proliferate fields and never trim. I love the idea of a simple text search box that Does The Right Thing with the input. Lots of work behind the scenes to make that happen, however. Google search is a giant natural language AI project. But, you know, I consider that to be an acceptable tradeoff: I sweat so my users don’t. Also note that fanfiction is a smaller search space that The Entirety of The Internet. However… sometimes your users want to tell you exactly what they’re looking for and have you find exactly that.

I was also thinking about “convention over configuration” as philosophy. Fandom uses : to structure tags. Roughly: category:tag-data. It uses ! as a decorator: cave!Buffy drunken!Giles woobie!character-name. Why not work with the conventions?

Well, first you have to pick conventions. And once you put your convention into your tags, you’re sort of nailing yourself down to a specific language. By which I mean: If your convention is fandom:btvs for naming fandoms, you’ve just chosen English. You can perhaps reduce the sting by going with single-letter conventions: f:btvs u:antennapedia c:giles t:alcohol t:wingfic t:"council of watchers". Now that’s something you could explain and make work in that glorious single search box.

I note that they also work for urls, which can be aliases for common searches. E.g., http://fic-o-rama.com/u:antennapedia/c:buffy/ can represent a search. There’s something unsatisfying about that.

I note that RFC3986 reserves , as a delimiter in URIs. See also section 3.3, in which @ is listed as a path character. # means a page fragment, so it’s no go in URIs.

Let’s try this:

@antennapedia c:buffy #swords #council of watchers in a search form
http://fic-o-rama.com/search/@antennapedia,c:buffy,swords,council+of+watchers in a url
http://fic-o-rama.com/@antennapedia/tagged/c:buffy,swords,council+of+watchers poss variation
http://fic-o-rama.com/search/c:giles,alcohol

Oh ho. Searches are now stable things that you can hand around.

You can also write the heavyweight search form around those conventions, for people who prefer it. You have a search api. The text search box translator and the giant form of many tickyboxes both use that API.

Riffing some more on this… You can programmatically determine that these tags refer to the same character and sort them together:
c:giles
c:hurt!giles
c:vamp!giles

Character tags can be deduced from pairing info trivially: giles/xander ⇒ c:giles c:xander

I tried this out on my fic archive. See the results here. Note the sorting of the tag cloud. (No, I’m not happy with the display either. I think my use of Isotope there wasn’t a win.)

Don’t have any conventions for word count or warnings. Note that you can address that if the results are fast, easily sorted, and give word count + warning info in the capsule summaries. It starts to matter when you have hundreds of results you want to winnow through, though.

I have a corpus of 121 stories to which I have applied 159 tags. (On my archive. I have applied far fewer to the versions posted on AO3, because tagging is heavyweight there.) I suppose I need to find out how typical that is. How much overlap from author to author? Even if we assume none, which is vanishingly unlikely, we’re only talking 1.3 million tags for a million stories, which is still in small data territory.

I would probably start by asking a group of fans with a library science background to come up with a list of a couple hundred canonical pan-fandom tags to seed the content tag list with. Auto-complete suggestions then coax users into tagging from the conventional set to start with. When in doubt, use one of those. Make it easy for the user to pick hurt/comfort instead of h-c or a zillion variants.

From there on out, you analyze the data for emerging trends. Oft-used tags migrate to the canonical list. Do as much automatically as possible. Sub-communities will develop their own idiosyncratic tagging habits; this is cool. You won’t be able to solve the problem of people misspelling things or the problem of “four million Johns in as many canons”. (Or you can solve it, but that way lies AO3 tag implementation madness, and we already know we don’t want that.) Just make people search on a fandom tag as well. If you make that cheap and easy to do, it won’t be a burden. f:sherlock john medical+school

Note that I make no statements about storage. All this is about the user-facing aspect of the design. I would so do this as full-text search, though. Store text, search text. You can probably index a single user’s tag corpus for fast access. By which I mean, I’d do it. Can also easily offer completion for a user’s existing tags.

analysis, geek