29 November 2011

What's The Difference?

To continue with the Triage project, I've spent a day or two with more graphics texts (about which I'll be musing anon), and getting more familiar with the mapping scenarios.

Separate from the scatterplot matrix data shown in Triage, which would be used to measure the micro components of a campaign, is the question of displaying national trend, twixt Us-uns and Them-uns. For that one turns to map graphics, which is a whole other world. Still in R, mind, but not statistical in nature.

What I have recently found is this site, which replicates a US map with 2004 election results. Now, our Apparatchiks won't be downloading zip files from outside sources, of course. On the other hand, the files make for a perfect dive board for the PoC. Load them into PG, swapping Republican for Bush and Democrat for Kerry and Other for Nader (that's not much of a stretch!). Just for completeness, I'd found much earlier (but can't find that I'd cited), this map exercise, but as of now, the author has been too embarrassed to post the R that does it. While only some form of income data (not specified), it is a follow-on (linked to) to an election stream map set, also not supplied with the R that made it. Nevertheless, one can conclude that with enough time, this is a task suited to R. As mentioned in an earlier post, the animation bits are likely via googleVis.

I'll be using his data, since it provides a basis and I don't have to concoct some, though not the R he used (still using the stock R from Wickham). It's not clear how the numbers were derived.

What is really useful about the 2004 map posting is the data source: a county level count. Get these into a PG table, and we have a surrogate for data which our Apparatchiks would have, and which we can further expand with relatively simple SQL; just to see how a map would change. The notion for this part of the Triage effort is to measure the effect of national campaign spending, post some event/ad/debate/foo, at the POTUS/party level; a RNC/DNC (or 501/527/foo group) view of the country.

Here's the new PG table where we load:

CREATE TABLE public.election (
state varchar(25) NULL,
county varchar(25) NULL,
tot_precincts int4 NULL,
precincts_reporting int4 NULL,
republican int4 NULL,
democrat int4 NULL,
other int4 NULL,
constraint pk_election unique(state, county)
)
WITHOUT OIDS
TABLESPACE pg_default


And we get it loaded thus (concated from the state/county files in the zip):

copy public.election from '/databases/rawdata/2004election/output.txt' using delimiters ';' csv header

Note that column names are underscored, rather than camelCase, since PG forces quoting to use anything in the database if there are Caps in names. Yuck.

And here's the PG + PL/R (I've left it as is; comment/uncomment to generate each of the maps, this is the difference map, shown last. The first set are for the two event maps, while the other is for the diff map):



CREATE OR REPLACE FUNCTION "public"."us_graph" () RETURNS text AS
$BODY$
X11(display=':5');
pdf('US_graph_diff.pdf');
library(maps)
library(plyr)
library(proto)
library(reshape)
library(grid)
library(ggplot2)
library(mapproj)
states <- map_data("state")
#elections <- pg.spi.exec ('select state, sum(republican) as "Republican", sum(democrat) as "Democrat" from election where event_number = 2 group by state order by state');
elections <- pg.spi.exec ('SELECT a.state, sum(a.republican - (SELECT b.republican FROM election b WHERE b.event_number = a.event_number - 1 and a.state = b.state and a.county = b.county)) as Republican FROM election a where a.event_number = 2 group by a.state ORDER BY a.state ')
elections$state <- tolower(elections$state)
elections$republican <- elections$republican/10000
choro <- merge(states, elections, sort = FALSE, by.x = "region", by.y = "state")
choro <- choro[order(choro$order), ]
#p <- qplot(long, lat, data = choro, group = group, fill = Republican / Democrat, geom="polygon", asp=.6)
p <- qplot(long, lat, data = choro, group = group, fill = republican, geom="polygon", asp=.6, main = "Poll Shift", xlab = "", ylab = "")
p + labs(y = "", x = "")
p + opts(panel.grid.major=theme_blank(), panel.grid.minor=theme_blank(), panel.background=theme_blank(), axis.ticks=theme_blank())
p + scale_x_continuous("")
p + scale_y_continuous("") + coord_map()
p + opts(axis.text.x = theme_blank(),axis.text.y = theme_blank(), axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.tick.length = unit(0, "cm"), axis.ticks.margin = unit(0, "cm"))
p + scale_fill_gradient(limits = c(0, 90))
print(p)
dev.off();
print('done');
$BODY$
LANGUAGE 'plr'


All that spinach for the library calls got eliminated by making an .Rprofile in postgres user's home with the following line:

.libPaths("/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")

You could also call out the libraries explicitly; both ways work. The additional spinach is various directions to eliminate the lat/long grid on the maps. None work!


Here's the Event 1 map:


Now, let's update the table to include an event_number (easier than using a date, anyway) and an event_type. That way, we can generate maps in sequence, but also note what sort of event just/last happened. We could also generate maps sequences for only certain sorts of events (they'd be in a check constraint).

So, let's make some new data:

insert into election (select state, county, tot_precincts, precincts_reporting, republican * .8, democrat * 1.2, other, 2, 'foo' from election where event_number = 1);

We wouldn't get such dramatic shifts (modulo Swift Boats) in the real world, but this is PoC territory.


This yields a new Event 2 map:


I'm still grappling with my main wish list item: showing the changes in the colors. As it stands, each map takes the full gamut, leaving the legend to display the shifts; doesn't do that all that well. Viewed another way, why not show the delta of polling strength (vote displays are a bit late, after all)? We can do that with a single map. How to get the data out of the election table? For that a correlated subquery is sufficient. It's that big SQL statement.


Here's what the delta map looks like:


What we see is the shift, in absolute, not relative, numbers. So Texas looks to be more Democrat from Event 1 to Event 2 just because it started with more votes; same with California.

Getting rid of the lat/long grid is still a problem, but then, this is a free PoC. Cheap at half the price.

No comments: