Monday, May 30, 2011

Scrapers and scammers and big shots

I started the blog NEPA Solar ten days ago with a few things in mind. First, I wanted to be able to find relevant information about what's going on in the solar industry in Northeastern Pennsylvania, and I was frustrated at my inability to find a good, centralized source with information and/or links to other sites with information.  So I decided to create one, at least create a site that provided links to the sort of information I was trying to find.

It seemed so easy.  I have some experience in the solar industry, from my days working at a now-defunct company called AstroPower in Newark, Delaware.  But quickly I realized I was in a little over my head. The best I could hope to do for a start was post links to solar distributors and suppliers that I found in three different Yellow Pages (along with the ones I already knew about), links to sites with information about solar power, and links to interesting stories about solar power with general relevance or relevance specifically to solar power in Northeastern Pennsylvania.  So I started to do the most basic of Google searches, for terms like "solar NEPA" or "solar northeastern pennsylvania" or even just "solar pennsylvania." And I started to find sites. A lot of sites.

Which brings us to the issue of scrapers.

As you may recall, a few weeks ago I came across a "scraper blog" while searching for blogs to add to NEPA Blogs. What this blog - now removed - did was republish entire blog posts from other blogs, but have both the blog title and blog author links point back to the scraper blog, creating the appearance that this material was actually written for the scraper blog in the first place.  Also, all ads were removed from the original posts, which were now heavily salted with ads that would pay money to the scraper blog. Stolen posts, misappropriated credit, misdirected advertising clicks - pretty low in terms of the blogosphere ecology.*

One of my friends and former co-workers is also very interested in the solar power industry, and he regularly posts links to interesting solar articles on his Facebook page.  I had read one of those articles one day, about a promising technology that combines photoelectric solar conversion into electricity with doing something with the heat of sunlight itself.  It was published in a respected cutting-edge online magazine and contained appropriate sources and credits.  But as I was searching for informational websites to add to NEPA Solar, I came across that same article several times, with the author's name replaced by the name of somebody associated with the site hosting the article and all sources and credits stripped out.  The article had been scraped.

I began to notice odd things about some of the sites I was visiting.  One had many interesting articles but used peculiar language: solar panels would be referred to as "gadgets," for example.  I realized this was similar to something I have seen in spam comments on blogs, where the same comment is posted with slight variations in language used: "I am the sort of person who is interested in new things" could become "I am the type of fellow who is intrigued by new gadgets" or "I am the kind of guy who is fascinated by the latest inventions", to quote an actual example. It's as if this statement was sent through a program that had multiple values for key words, and these values could be arranged randomly to create seemingly different statements.  It would not be a huge leap to use this to create a program that could copy a block of text and then use a sort of thesaurus function to change the words enough to escape immediate detection by someone hunting for plagiarists.

Meanwhile, other sites took a different approach: offer a large number of articles with only a small amount of information for each, but load up the site with ads.  In these cases, as with the scraper blogs, the most likely motive was to get advertising revenue by drawing traffic with minimal effort and presenting visitors with a plethora of ads. The sites were only nominally about solar power, or solar power in Northeastern Pennsylvania. They are, as far as information content goes, a scam designed to generate advertising revenue.

Which, in a sense, was the same as my second reason for starting NEPA Solar: to generate specific, solar-focused ads that visitors might be inclined to click on.

That's one of the problems of putting ads on a "life" blog like Another Monkey.  The ad programs don't know what to make of my writing.  One day I'm focused on writing, another on politics, another on gardening, another on stargazing.  What kind of blog is this? What sort of ads would work best? With Another Monkey it's a crap shoot day after day, ad-wise.  But with a site like NEPA Solar, the ads can be very specific to the solar industry.  And once I start drawing traffic to that site, some of those ads might generate some minuscule revenue.

Now, it turns out that the "scraper blog" that had been stealing content from numerous bloggers throughout Northeastern Pennsylvania had been around, apparently, for a long time.  Maybe five years.  Five years of quietly stealing content without anyone noticing.  That doesn't really suggest the blog resulted in a big advertising revenue stream. So why do it at all?

Well, the most likely answer seems to be: prestige.  If this individual could point to this blog and state  (or, perhaps, imply) that all of these bloggers worked for him, toiling away generating content for him to post on his site, that might be impressive to someone, someone inclined to be impressed by that sort of thing. He creates the illusion that he is a big shot, and then convinces others that he is a big shot, and thereby becomes a big shot.

Which is the third reason behind NEPA Solar.  In that sense, the scraper blog inspired me to create a blog that might allow me to define myself as a big shot - the driving force behind Northeastern Pennsylvania's foremost (read: only) website focusing on solar power and how it relates to the region.  Yessir, you're interested in finding out more about solar power? Come to my site and you'll find all sorts of helpful information at no charge whatsoever. Oh, and while you're there, be sure to take a look at a few of our sponsors' ads.  Yes, indeedy, fine upstanding advertisers on our site... That must be worth something, somewhere, right? Perhaps even another line item on my résumé, or a discussion point during job interviews.

As I'm gathering information for this blog I'm learning things.  Things like the solar farm planned for Nesquehoning, and the collapse of Pennsylvania's SREC market due to an oversupply of solar energy on the market (state-specified targets have been met and exceeded) as well as Pennsylvania HB1580, a piece of legislation intended to prop up the solar market by both raising the targets for solar energy and closing off Pennsylvania's borders to out-of-state SRECs, something that is already being done in many other states.**  Unfortunately, if this bill does not pass, Pennsylvania's fledgling solar industry may crash just as it is starting to stretch its wings - and NEPA Solar will actually be a documentation of the death of the solar industry in Northeastern Pennsylvania.

*Google Reader and similar services do sort-of kind-of the same thing, except clicking on the post title and/or author's name would take you to the original post. But by using Google Reader it is possible for someone to read every post from a blog without ever actually visiting that blog.

**I'm also discovering that  someone seems to be poisoning solar information sites with viruses, including several sites hosting articles that are linked to directly by the website of the State Representative who authored HB1508.

No comments: