DigHumNotes: August 2011

Friday, August 19, 2011

Data Entry Redux

Earlier this week, I finally completed the last batch of data entry work for the current phase of this project. This involved entering information for major poets connected with the "Shi to shiron" poetry journal, several of whom stand as the most prolific in terms of total output between 1920 and 1944. I now have nearly 50 poets in my database, with a timestamp for each piece appearing in any of the journals to which they contributed. Owing to time and manpower constraints, I chose to input only the year and month for each poem or essay, and did not record titles except as uniquely assigned integers. At a later phase of the project, it may well be worth including things like title, volume number, and page, but only if the intention is to create an interactive resource where someone might want to call upon that information and have it accessible.

Having completed the data entry, I then began loading it into various visualization programs (primarily Cytoscape and Gephi) and experimenting with the possibilities for meaningful display of the data. This included creating individual time slices for the two-mode bipartite graph (poets x journals) so as to better chart the dynamic and shifting relationships in the network over time. I'm also working on creating time slices based on one-mode affiliation graphs generated from the bipartite data. These are graphs which attempt to represent the weighted connections between individual poets vis-a-vis their participation in the various journals. Thus, for instance, if Poet A contributed 8 pieces to Journal Z, and Poet B contributed 4 pieces to that journal during the same year, then we can say that they are linked through Journal Z by the minimum number of pieces shared between them, i.e., 4. Were the two poets also to contribute to another journal that year, then that minimum would be added to 4 to produce the total weight of their connection. I am not yet finished with the time slices, but my hope is that it will create an insightful picture of how the poets involved in "Gakko" and "Shi to Shiron" were related to one another over time and how their career trajectories ebbed and flowed with the fluctuations of the cultural field.

In the course of doing this visualization work, which has often been done at considerable individual cost and questionable analytical benefit, a number of ideas have arisen as to how we might make the most effective use of the data we have and what fruitful avenues of exploration might lie ahead. One thing that has become apparent is the value of color-coding nodes based on different attributes. At present, my color scheme relies on dividing the poets according to their affiliation with either of the two journals being surveyed. Thus the "Gakko" poets are neon purple, and the "Shi to Shiron" poets are light blue. What was perhaps most amazing is that this delineation into two groups had an almost natural correspondence with the topology of the graph, which on its very own pushed the two groups of poets to either side of a mitochondrial-like ellipse. Those poets with the most shared ties between the two groups gravitated toward the center, yet without daring to cross into the other's territory. Here is an image of the entire dataset, from the years 1922-1944, and a closeup of that same image. The graphs themselves aren't that meaningful given the large time span, but it is easy to get an idea of the kind of grouping I am trying to describe here. Given that the layout algorithms are working with edge weights to determine the relative repulsion and attraction of the nodes, it is perhaps not surprising that the two groups of poets would end up like this, the more prolific poets gathered at the center and the more minor ones pushed to the fringes of their respective blocks.

What I would like to try in the weeks ahead is to color the nodes according to different attributes (place of birth, education level, place of education, number of pieces submitted) and see how these do or do not line up with the initial color coding. What might also prove interesting is to remove some of the more dominant nodes and consider how the 2nd and 3rd tier poets (at least in terms of output) are connected to one another.

Time permitting, I would also like to experiment with these avenues of analysis, interpretation, and data enrichment:

1) Create a timeline of two-mode graphs to better show the lifespan of the journals over the entire period and to give a better sense of the total number of active poets in any one year.

2) Create affiliation graphs where the journals are the nodes, instead of the poets, and see what kinds of groupings emerge in any one period or over the entirety of the time span. This data could also be geocoded so as to give a sense of the relation between Tokyo, provinical, and colonial journals. We would have to be careful, however, to contextualize all results as a product of just the 50 or so poets involved, and therefore not reflective of the total number of contributors to journals other than "Gakko" and "Shi to shiron".

3) Scan images of all the journal covers and associate these images with the appropriate nodes using the Cytoscape custom graphics manager. This might be worth saving for a later date, but I could also just do a few samples so as to provide a sense of the possibilities of this project as a fully interactive database that does not completely erase the material specificity of the objects being represented (ala Manovich's argument).

Friday, August 12, 2011

Data Entry

I've spent two days re-entering data on my network of 25 poets, resulting in about 110 nodes and nearly 2,000 edges. The reason for this data-entry push has been to gain some granular clarity regarding time, so that now each submission to a poetry magazine is coded as an edge in the network. This allows me to insert things like title, year of publication, etc. as attributes of the edges, and will also allow me to create a more temporally dynamic view of poetic networks across the years 1920 to 1944. Time consuming as it is to input this information (and I've only done the bare minimum thus far, leaving out the Japanese titles, the volume and page numbers), I think it will ultimately be much more meaningful to have a year by year snapshot of the poetic field. And it is still possible to load this edgelist into a program like UCINET and have it convert the duplicate edges (i.e., a contribution made to the same journal) into weighted edges upon which I can then run some statistical analyses.

Next week's goal will be to expand the project by inputting information for a second group of poets who are predominantly known for their affiliation with the avant-garde journal "Shi to shiron" (『詩と詩論』), which ran from 1928 to 1933, and thus overlaps with the period in which "Gakko" was being published. The goal here will be to see how closely or loosely connected are these two sets of poets, who are traditionally separated by ideological and aesthetic orientation. To what extent do the poets who contributed to these two journals participate in the same magazines over the course of the 24 year period that concerns us? Do we see clear groupings that would confirm our historical sense of their affinities (or lack of affinity) with one another? Do they converge or diverge at different points in time? Do the poets participating in each journal exhibit different network typologies? My hope is that by looking at these two groups in particular, I might be able to show the ultimate benefit of inputting the modernist poetry database in its entirety.

Thursday, August 11, 2011

Visualizing Information (1)

In response to a recent query by participants of an NEH-sponsored Network Analysis Workshop, I've begun to do some preliminary thinking about the interpretive assumptions that underlay the visualization of social networks and the process of translating historical information into quantifiable social-network "data". In recent years, a number of scholars have begun writing about the recent push in the humanities to collate and visualize "data," a movement that potentially raises all kinds of ideological problems owing to the too easy adoption of seemingly objective methodologies developed in scientific and social-scientific disciplines. Notable among these scholars is Johanna Drucker, who in the essay "Humanities Approaches to Graphical Display" argues for a humanistic (e.g., interpretive and observer-dependent) approach to the collection and display of data. Indeed, Drucker prefers to use the term "capta" to highlight the very constructed nature of the information so often treated as "realistic" representation and yet which is actually the product of (oftentimes) troubling ideological assumptions. She goes on to offer concrete suggestions for how one might rethink the standard tools for quantifying and displaying information (e.g., bar graphs, pie charts, timelines etc.) in ways that highlight, rather than hide, the constructed nature of data and the epistemological assumptions that bring it into being.

Drucker's call for a more imaginative and critically-minded engagement with data visualization is important and necessary, even if it provokes confusion and befuddlement at the thought of how we might tailor visualizations to reflect the variegated nature of reality and the multiple positions from which it can be viewed. (Of what use are such visualizations to those who need to share information efficiently and in ways that allow for ready comparison?). Regardless, I do think it vital that we think about what gets lost (and what exactly we gain) in the process of creating our "captasets" and turning them into material for abstract statistical and visual analysis. If, as Lev Manovich asserts, "we throw away 99% of what is specific about each object to represent only 1% in the hope of revealing patterns across this 1%," then it behooves all of us who do SNA work to consider what it is we are retaining when we abstract edges and nodes from rich social data and whether or not that 1% is meaningful as a way of rethinking the other 99%.

For my own project, I've given considerable thought to what interpretive assumptions underlay the ways in which I'm both collecting and organizing information on modernist poetic networks in Japan. Perhaps the biggest assumption (or reduction) I'm forced to make in order to work with such information is to treat all submissions to poetry journals as essentially equal in value. This is regardless of their content, style, length, original time of writing, manner of publication (i.e., was it printed or mimeographed?), place of publication, and potential for diffusion. I have thrown out, in other words, nearly all of the information that would allow us to assess these objects as written artifacts rooted in highly specialized and context-dependent fields of discursive production. All that remains is the reality of one individual having had his or her name attached to one piece of printed matter in one particular journal at a particular time.

Another large assumption I make is that the appearance of submissions in the same poetry journal constitutes a meaningful connection between the authors of those submissions. In many cases this makes good sense, as journals were often the product of small coteries of poets who banded together for the express purpose of making public their stylistically and/or ideologically similar poems and ideas. To appear in the same journal was thus a statement of allegiance to those who shared in one's aesthetic or political ideals. In other cases, however, it is easy to imagine that sharing space in a journal meant very little to the poet's involved and that it may have signaled anything but a shared sensibility. Some may have treated the fact as mere happenstance (whether fortuitous or not), and we cannot rule out that some journals created space for widely divergent and differing viewpoints. Such detail, however, is impossible to capture at scales of sufficient magnitude, and thus one has to settle for treating all simultaneous appearances as representative of a singular association between two poets, whether that association be based on collaboration, antagonism, or coincidence, and whether it be an association experienced by the poets themselves or simply recorded as archival fact.

What I have to come to terms with, then, and I'll end on this point, is that the very manner by which I'm constructing an analyzable database of information produces its own set of meanings that may or may not correspond to the meanings produced by more traditional and more linguistically attentive (or should we say micro-oriented?) methodologies. At issue here, ultimately, is the meaning of "relation" itself.

Monday, August 1, 2011

Rethinking the Database

After a meeting with our local humanities computing director, I decided that it would probably be worthwhile to begin recording the individual submission data for the poems in my database. So rather than simply count the number of instances in a particular journal, the idea would be to track every single submission as a separate and distinct edge (or link). Adopting such an approach should also make it easier to create dynamic network visualizations that represent change over time.

With this in mind, today I began to work with a very small dataset to test how the workflow might look when going from the kind of database described above (with each poem constituting a link) to any of the SNA analysis and visualization tools. I had little trouble loading the edgelist into UCINET or ORA, and both programs automatically translated the multiple links between a poet and a journal into a link weight. The problem, however, is that both programs did this by reducing the connection to a single line, which then rendered irrelevant the data for each poem (i.e., title, date). I can imagine adding this information as an edge attribute in a program like Cytoscape, but given the editing limitations, this could mean a lot of labor intensive input. For while I can easily bring up a list of edges and add an attribute, I can't seem to sort this edge in the same way as my database. And I'm still stuck with the problem of having all submissions to one journal condensed into a single edge. Is there a way to prevent this from happening?

Update (8/26): Soon after writing this post I discovered that Cytoscape can be used to import the edgelist from a .CSV file, and that rather than condensing the edges into a single weighted line, it preserves each and every one as a distinct link. This is extremely useful for me, though the graph itself (at least for the total time span) is obviously quite messy.