jenett | Copyright, content scraping, and related mysteries of the universe

Across my DW circle today, there have been lots of comments about a site - FriendBlab - scraping content and making it available. (However, I haven't seen comment about it on LiveJournal, where the same issues apply, so hi.) It looks like they scraped roughly 6000 journals each from LiveJournal, Dreamwidth, and InsaneJournal, along with a bunch of other sites. (

elf has stats).

Before we go any further: a) lots of people have made DMCA requests b) the site is currently down (as best one can tell, removed by their hosting service, GoDaddy), and c) aware people are aware.

However, it's brought up a certain range of usual confusions about some topics, so let me put on my "Jenett attempts to explain" hat, and see if I can help. (please see #4, below, for some comment moderation notes)

Within:
1) Content scraping, public data being public, and how those two things fit together.

2) DMCA process, how it works, and what that "10 days" you see in the GoDaddy responses means. (Includes "only the copyright holder can file a DMCA takedown notice")

3) What you can expect from sites when you report this kind of thing and other useful stuff to know.

4) My background on some of this stuff, for people who don't know me.

dingsi has a good roundup of relevant links and sample responses.

1) Content scraping, public data, and the way those two combine.
Basically, what happened here is someone said "Self! I think it would be a good idea if there was a site that aggregated public data from people's social networking/journal site profiles, other public data like their RSS feeds, and so on, and combined it!" using FOAF - friend of a friend - links.

This is not a particularly wise idea, as it turns out.

What’s public?
Dreamwidth, LiveJournal, InsaneJournal, and other sites of that kind make certain kinds of data public. This includes public posts and profile information that can't otherwise be hidden - for example, both DW and LJ do not allow you to fully hide who your friend/circle/subscribed connections are, because those things are partly under your control, and partly under other people's control.

Also, public posts are public. RSS feeds of public posts are also public: all LJs and DWs have them, but how much content they contain depends on a) how many public posts you make and b) some of your settings. (Dreamwidth instructions over here. Dreamwidth has a full list of what’s visible to anyone. (I belive that there are differences now from what’s visible on LiveJournal, but I am not digging for that right now as I’d like to get this posted sooner than later.)

You can do things that minimise your visiblity - both LJ and DW have a setting that discourages search bots from finding your journal. (Discourages, not blocks entirely: people can write search bots that ignore this very easily.) And both have ways to limit, say, the info in your RSS feed to title only.

In this case, it appears that even for people who had that selected, the search bot was looking not just at "Hey, what profiles can I find on [site]" but also "Who are these people connected to?" (which is, again, public data - note that even if, say, on Dreamwidth, you hide who has access to or subscribes to your journal, that information is still public access on theirs.)

Non-public stuff
I have seen a few people report that the feeds appear to have contained locked posts, but I haven’t seen anything that’s explicit enough for me to be sure. (I’d love additional data, if you’ve seen it/experienced it, but with the site down, it’s hard to check. When I checked my own journal, the only RSS feed entries were public: I make only a few posts public, and most are some degree of access-restricted.)

My suspicion is that one of two things is happening there:
a) people are really talking about appearing at all (i.e. that they were found through friend-of-a-friend links, even though they had search bot searching disabled on their account.) (The specific comments I’ve seen as of this writing have been unclear about what exactly they consider private data.)

or b) some of those posts may have been posts that were originally public, and rapidly edited to be private. The RSS feeds *should* respond to this quickly, but it’s possible that going that way (rather than posting it originally locked) may lead to this kind of glitch. Again, the relevant site help files can help you set a default posting security level. (Mine default to access-locked: I have to go in and manually edit it to make a post public.) Dreamwidth’s info about this, for the curious.

It is also possible that if you do edits to make your posts private or access locked that they are otherwised cached/stored by various sites around the ‘Net. The most secure solution is to set a default level of protection.

2) DMCA process and copyright compliance on the Internet
First off, the DMCA is the Digital Millenium Copyright Act, which governs how Internet sites based in the US handle copyright complaints. It is an amazingly flawed bit of law, but it is what we’ve got. (I am not going into the ways I consider it amazingly flawed, because we’d be here for quite a while.)

There are three basic principles you need to know:
a) Only the copyright holder (or their legally authorised represntative, which means ‘lawyer’, not ‘friend’, ‘spouse’, ‘parent’, ‘helpful random person’) can actually file a DMCA claim.

b) There is a specific process the hosting site has to follow. If they follow that process, they are exempt from legal prosecution. It is a rather stupid process. More on this in a minute.

c) If the process does not work, there is not much you can do without going to court (which given the question of jurisdiction, is complicated.) And the process is under penalty of perjury.

This means several things:
a) Dreamwidth (or whatever other site you *put* the content on) cannot take action for you. They are not the copyright holder, and they won’t let you authorise them to be so, because that is not their job. (This is a good thing, really, because I don’t know about you, but I want them running the site they’ve designed not doing other things.)

b) The process has to be followed specifically.

There’s a big list of things you need to include for a proper DMCA takedown notice. Dreamwidth’s DMCA information has the information you need to make a report to them, but it’s also one of the more readable versions of “information you need to make a report, period” once you remove their specific contact info - specifically all of the numbered points must be met.

If you send a report that doesn’t include all the required info, the hosting site is not required to do anything about it. (If they’re nice, they’ll send you a note saying it’s incomplete. Most sites are not that nice/competent/useful. There are a lot of really badly handled DMCA compliance contacts.)

Note in particular that the statements are made under penalty of perjury: if you lie in them, that’s a criminal offense. People try to do that, and more often than you’d think (above and beyond the far more reasonable situations where there’s some legitimate confusion, or people trying to be helpful and reporting something they know is a violation.)

c) The process goes like this:
- Site gets properly formed complaint.
- Site says to person who posted the stuff in the complaint “Hey, we got this complaint.”
- The person who has received the complaint can say “I did wrong”, “I think this is a legit use, but I’m not going to fight removing the content” or “I think this is a legit use, and I want to counter notify of my intentions.” The last one has further legal obligations.
- The site takes the necessary action, and (one hopes) informs the complainant of the outcome.

Generally, people have 24-48 hours to respond to the initial complaint (sites are required to take action ‘promptly’ but no one defines what that really means, so there’s often a little flex over holidays, etc). If the poster of the alleged infringement counterfiles, the material is disabled for 10 days to give the complainant time to respond, but then re-enabled unless the complainant says “I do intend to take this to court.” (See also: stupid legal process).

This is why you’ll sometimes see stuff disappear, then go back up again at the end of the 10 days.

See? I said it was complicated.

If someone has multiple offenses, they may have their account yanked (but the law doesn’t say how many is ‘multiple’ and sites don’t generally post their internal limits to avoid rules-lawyering).

And while LiveJournal and Dreamwidth (and I’m sure a number of other social sites) have a mechanism whereby site admins can disable a single infringing post, if there’s an infringement, say, on a privately hosted website/WordPress blog/whatever, chances are the hosting service is just going to disable the entire account, not just that one post. (As in this case)

3) What you can expect from sites when you report this kind of thing and other useful stuff to know.

Sites can only control what’s on their servers
In other words, while Dreamwidth can inform people about privacy options, RSS feed limiting tools, etc. they cannot file a complaint for you, make another site take action, or anything else along those lines.

Sites need to follow the legal process for their own protection
Because while it’s a stupidly designed law, the penalties for sites failing to follow the legal process are huge. You do not want a site you use to be stupid about this. Really.

Many many sites (including most of the huge ones) are really rather lousy at dealing with complaints
DW, LJ, and other smaller sites (by which we mean thousands to hundreds of thousands active accounts, not millions) tend to be quite responsive. Really tiny sites (i.e. privately hosted forum/set of webpages/etc., any single-person project) vary between responsive (always appreciated), not responsive, nasty, and deeply confused, in about equal parts.

There’s a lot of confusion out there. The best thing you can do is a) hang out on sites that have a clue b) learn the legal requirements (or at least where to find them if you need them) and c) keep people in your networks who can help if your stuff shows up somewhere you don’t want. (Because there are those of us who remember and watch this kind of stuff for fun. Or at least amusement.)

If you create content (and you do! Posts like this are copyrighted the moment they have a fixed form on the screen), you should also a) consider how far you want to go to look for infringement and b) have a will valid in your current legal residence that says who you’d like your literary executor to be. (This latter is not directly relevant to the conversation at hand, but it’s worth saying whenever the chance comes up. Neil Gaiman explains why.)

4) My background on some of this stuff, for people who don't know me.
I spent about 18 months - from January 2003 to September 2004 - as a volunteer for the LiveJournal Abuse Team (also later called the LiveJournal Terms of Service Team).

denise (more recently one of the co-founders and co-owners of Dreamwidth) was the team manager then, and while I was there, I handled (as we all did) a number of DMCA requests from the site admin side.

We also got an *absurd* number of reports from people saying “So and so was mean to me on AIM/YahooChat/whatever other site, can you make them stop?” To which we’d generally look, sigh, and carefully explain that as much as it might be nice, we did not control the entire Internet. Alas. There are days it would have been much easier.

Since then, I have gone on to get my Master’s in Library and Information Science, and I work as an IT librarian for a small college library. While I no longer am the copyright compliance person anywhere, I do keep up with a range of bloggers, tech-folks, and geeks who are. I am a librarian, so questions are potentially answered (allowing for my time, energy, and the part where I am a librarian, not a lawyer, and especially not an intellectual property lawyer specialising in online settings and copyright.)

If you’re interested in another librarian who posts some awesome stuff about this, I recommend Nancy Simms, copyright librarian at the University of Minnesota. While she’s focused on higher education copyright issues (that being her job), she’s also very aware of other kinds of Internet use, and the implications of social sites, etc, and writes about them among her other topics. She also is actually a lawyer.

Finally: this is a public post, but I am a Jenett of wildly ranging energies and mental spoons some days. I’m glad to have comments but a) may not reply immediately (or necessarly at all, if I don’t feel it needs comment), and I reserve the right to moderate as needed to keep it a productive space for conversation. (That doesn't mean “everyone agree”, that means “conversation can keep going forward about the topic at hand”.)

Flat | Top-Level Comments Only

From:

jesse_the_k

This is an awesome post. Thank you very much. (And no reply needed)

dharma_slut

Thank you so much!

I admit that I didn't think to check on what I'm letting into my feed, but public is public and feeds are feeds.

(They didn't exist back in the day, it was so exciting when they started existing!)

silveradept

Thank you for the information on what's going on, and how things are supposed to be done.

I think it seems a bit sleazy to most people that someone can scrape with a bot and repost content to wherever and there's no mechanism in place to tell them to go away that they have to listen to. It weirds people out when that happens.

jenett

You're definitely right that it's sleazy. (And scraping profiles is particularly so.)

Personally, I'm not entirely sure where the copyright violations or other legality would come down on this: RSS feeds are public and meant to be shared (that's part of the protocol for them, after all.)

However, I think there's a point in the thread here on Elf's journal that while the RSS content isn't actionable, user icons and other unique material (bio, she makes a point about specifically chosen/worded interests) likely is. But I also don't know of a precedent setting case that comes anywhere near it.

On the other hand, one of the things LJ Abuse work taught me is exactly how inventive people can be using technology in a way you didn't expect. Also how *little* awareness of history people have, in terms of technology getting used a particular way, even people working in highly similar sphere. (I see this a lot with Facebook: some new issue gets brought up, and it's something LJ was dealing with in 2003, 2004, 2005, and should not be news for people working in the same general industry.)

dingsi

Ooh shiny and helpful! Included in the linkspam. :)

lapin-agile.livejournal.com

So. I came to this late enough that the site is down and I can't check to see if any of my blogs was scraped, except by doing a google search for friendblab.com + [name of blog]. (It appears that lapin_agile is on some other friends lists, but may not have been scraped itself. And the alternity blogs do not appear to be there, though the game pops up in your post about squid on the mantlepiece.)

There are lots of levels at which this is annoying, but not least is the fact that if one chose to have an account at friendlab.com, one could login (according to the google-able user info for the site) and, presumably, delete or alter posts in one's account. But if they've scraped from an account on an elsewhere-blog and plonked that content up on their site, the creator of that content has no way of logging in and deleting because... no user account.

Surely that's a version of identity theft? And... I don't even know what category that belongs in, having one's stuff put without permission on a site, but not having the options/rights of having an account with the site.

It looks like they only scraped about 6000 journals from DW, and about 6000 from LJ - which, proportionately, is not many. (I'm betting I got snagged because I'm on the access lists of several hub people, including D.)

And I agree, it's totally sleazy and lousy. I'm just not sure what the legal implications are.

I think, roughly:
- Copyright is actionable for user icons people have created from unique materials. (F'ex, either I or my ex-husband would hold the copyright for my default harp icon: it's a photo manip of a photo I took, he fiddled with using his software, and I did the cleanup work for.)

- The rest of it, I don't even know. It's not exactly identity theft, because they're not actually pretending to be the profiles they've scraped. But it's dishonest, and it's sleazy, and it's a lousy business practice besides.

The Department of Justice website about identity theft implies that it might be legally actionable if someone wanted to bring a case *if* there's also an attempt to commit a crime. Besides, y'know, delighting me in starting with a Shakespeare quote.

And here's a thing: when they scraped, they included posts (stripping cut tags and other site-specific warning codes) with NSFW/not-appropriate-for-minors material. And that might, in fact, be enough.

But it's an area I know much less about than the copyright piece.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

this is what it sounds like

let the jagged edges meet the light instead

Copyright, content scraping, and related mysteries of the universe

Copyright, content scraping, and related mysteries of the universe

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

October 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags