DigHumNotes

Friday, August 19, 2011

Data Entry Redux

Earlier this week, I finally completed the last batch of data entry work for the current phase of this project. This involved entering information for major poets connected with the "Shi to shiron" poetry journal, several of whom stand as the most prolific in terms of total output between 1920 and 1944. I now have nearly 50 poets in my database, with a timestamp for each piece appearing in any of the journals to which they contributed. Owing to time and manpower constraints, I chose to input only the year and month for each poem or essay, and did not record titles except as uniquely assigned integers. At a later phase of the project, it may well be worth including things like title, volume number, and page, but only if the intention is to create an interactive resource where someone might want to call upon that information and have it accessible.

Having completed the data entry, I then began loading it into various visualization programs (primarily Cytoscape and Gephi) and experimenting with the possibilities for meaningful display of the data. This included creating individual time slices for the two-mode bipartite graph (poets x journals) so as to better chart the dynamic and shifting relationships in the network over time. I'm also working on creating time slices based on one-mode affiliation graphs generated from the bipartite data. These are graphs which attempt to represent the weighted connections between individual poets vis-a-vis their participation in the various journals. Thus, for instance, if Poet A contributed 8 pieces to Journal Z, and Poet B contributed 4 pieces to that journal during the same year, then we can say that they are linked through Journal Z by the minimum number of pieces shared between them, i.e., 4. Were the two poets also to contribute to another journal that year, then that minimum would be added to 4 to produce the total weight of their connection. I am not yet finished with the time slices, but my hope is that it will create an insightful picture of how the poets involved in "Gakko" and "Shi to Shiron" were related to one another over time and how their career trajectories ebbed and flowed with the fluctuations of the cultural field.

In the course of doing this visualization work, which has often been done at considerable individual cost and questionable analytical benefit, a number of ideas have arisen as to how we might make the most effective use of the data we have and what fruitful avenues of exploration might lie ahead. One thing that has become apparent is the value of color-coding nodes based on different attributes. At present, my color scheme relies on dividing the poets according to their affiliation with either of the two journals being surveyed. Thus the "Gakko" poets are neon purple, and the "Shi to Shiron" poets are light blue. What was perhaps most amazing is that this delineation into two groups had an almost natural correspondence with the topology of the graph, which on its very own pushed the two groups of poets to either side of a mitochondrial-like ellipse. Those poets with the most shared ties between the two groups gravitated toward the center, yet without daring to cross into the other's territory. Here is an image of the entire dataset, from the years 1922-1944, and a closeup of that same image. The graphs themselves aren't that meaningful given the large time span, but it is easy to get an idea of the kind of grouping I am trying to describe here. Given that the layout algorithms are working with edge weights to determine the relative repulsion and attraction of the nodes, it is perhaps not surprising that the two groups of poets would end up like this, the more prolific poets gathered at the center and the more minor ones pushed to the fringes of their respective blocks.

What I would like to try in the weeks ahead is to color the nodes according to different attributes (place of birth, education level, place of education, number of pieces submitted) and see how these do or do not line up with the initial color coding. What might also prove interesting is to remove some of the more dominant nodes and consider how the 2nd and 3rd tier poets (at least in terms of output) are connected to one another.

Time permitting, I would also like to experiment with these avenues of analysis, interpretation, and data enrichment:

1) Create a timeline of two-mode graphs to better show the lifespan of the journals over the entire period and to give a better sense of the total number of active poets in any one year.

2) Create affiliation graphs where the journals are the nodes, instead of the poets, and see what kinds of groupings emerge in any one period or over the entirety of the time span. This data could also be geocoded so as to give a sense of the relation between Tokyo, provinical, and colonial journals. We would have to be careful, however, to contextualize all results as a product of just the 50 or so poets involved, and therefore not reflective of the total number of contributors to journals other than "Gakko" and "Shi to shiron".

3) Scan images of all the journal covers and associate these images with the appropriate nodes using the Cytoscape custom graphics manager. This might be worth saving for a later date, but I could also just do a few samples so as to provide a sense of the possibilities of this project as a fully interactive database that does not completely erase the material specificity of the objects being represented (ala Manovich's argument).

Friday, August 12, 2011

Data Entry

I've spent two days re-entering data on my network of 25 poets, resulting in about 110 nodes and nearly 2,000 edges. The reason for this data-entry push has been to gain some granular clarity regarding time, so that now each submission to a poetry magazine is coded as an edge in the network. This allows me to insert things like title, year of publication, etc. as attributes of the edges, and will also allow me to create a more temporally dynamic view of poetic networks across the years 1920 to 1944. Time consuming as it is to input this information (and I've only done the bare minimum thus far, leaving out the Japanese titles, the volume and page numbers), I think it will ultimately be much more meaningful to have a year by year snapshot of the poetic field. And it is still possible to load this edgelist into a program like UCINET and have it convert the duplicate edges (i.e., a contribution made to the same journal) into weighted edges upon which I can then run some statistical analyses.

Next week's goal will be to expand the project by inputting information for a second group of poets who are predominantly known for their affiliation with the avant-garde journal "Shi to shiron" (『詩と詩論』), which ran from 1928 to 1933, and thus overlaps with the period in which "Gakko" was being published. The goal here will be to see how closely or loosely connected are these two sets of poets, who are traditionally separated by ideological and aesthetic orientation. To what extent do the poets who contributed to these two journals participate in the same magazines over the course of the 24 year period that concerns us? Do we see clear groupings that would confirm our historical sense of their affinities (or lack of affinity) with one another? Do they converge or diverge at different points in time? Do the poets participating in each journal exhibit different network typologies? My hope is that by looking at these two groups in particular, I might be able to show the ultimate benefit of inputting the modernist poetry database in its entirety.

Thursday, August 11, 2011

Visualizing Information (1)

In response to a recent query by participants of an NEH-sponsored Network Analysis Workshop, I've begun to do some preliminary thinking about the interpretive assumptions that underlay the visualization of social networks and the process of translating historical information into quantifiable social-network "data". In recent years, a number of scholars have begun writing about the recent push in the humanities to collate and visualize "data," a movement that potentially raises all kinds of ideological problems owing to the too easy adoption of seemingly objective methodologies developed in scientific and social-scientific disciplines. Notable among these scholars is Johanna Drucker, who in the essay "Humanities Approaches to Graphical Display" argues for a humanistic (e.g., interpretive and observer-dependent) approach to the collection and display of data. Indeed, Drucker prefers to use the term "capta" to highlight the very constructed nature of the information so often treated as "realistic" representation and yet which is actually the product of (oftentimes) troubling ideological assumptions. She goes on to offer concrete suggestions for how one might rethink the standard tools for quantifying and displaying information (e.g., bar graphs, pie charts, timelines etc.) in ways that highlight, rather than hide, the constructed nature of data and the epistemological assumptions that bring it into being.

Drucker's call for a more imaginative and critically-minded engagement with data visualization is important and necessary, even if it provokes confusion and befuddlement at the thought of how we might tailor visualizations to reflect the variegated nature of reality and the multiple positions from which it can be viewed. (Of what use are such visualizations to those who need to share information efficiently and in ways that allow for ready comparison?). Regardless, I do think it vital that we think about what gets lost (and what exactly we gain) in the process of creating our "captasets" and turning them into material for abstract statistical and visual analysis. If, as Lev Manovich asserts, "we throw away 99% of what is specific about each object to represent only 1% in the hope of revealing patterns across this 1%," then it behooves all of us who do SNA work to consider what it is we are retaining when we abstract edges and nodes from rich social data and whether or not that 1% is meaningful as a way of rethinking the other 99%.

For my own project, I've given considerable thought to what interpretive assumptions underlay the ways in which I'm both collecting and organizing information on modernist poetic networks in Japan. Perhaps the biggest assumption (or reduction) I'm forced to make in order to work with such information is to treat all submissions to poetry journals as essentially equal in value. This is regardless of their content, style, length, original time of writing, manner of publication (i.e., was it printed or mimeographed?), place of publication, and potential for diffusion. I have thrown out, in other words, nearly all of the information that would allow us to assess these objects as written artifacts rooted in highly specialized and context-dependent fields of discursive production. All that remains is the reality of one individual having had his or her name attached to one piece of printed matter in one particular journal at a particular time.

Another large assumption I make is that the appearance of submissions in the same poetry journal constitutes a meaningful connection between the authors of those submissions. In many cases this makes good sense, as journals were often the product of small coteries of poets who banded together for the express purpose of making public their stylistically and/or ideologically similar poems and ideas. To appear in the same journal was thus a statement of allegiance to those who shared in one's aesthetic or political ideals. In other cases, however, it is easy to imagine that sharing space in a journal meant very little to the poet's involved and that it may have signaled anything but a shared sensibility. Some may have treated the fact as mere happenstance (whether fortuitous or not), and we cannot rule out that some journals created space for widely divergent and differing viewpoints. Such detail, however, is impossible to capture at scales of sufficient magnitude, and thus one has to settle for treating all simultaneous appearances as representative of a singular association between two poets, whether that association be based on collaboration, antagonism, or coincidence, and whether it be an association experienced by the poets themselves or simply recorded as archival fact.

What I have to come to terms with, then, and I'll end on this point, is that the very manner by which I'm constructing an analyzable database of information produces its own set of meanings that may or may not correspond to the meanings produced by more traditional and more linguistically attentive (or should we say micro-oriented?) methodologies. At issue here, ultimately, is the meaning of "relation" itself.

Monday, August 1, 2011

Rethinking the Database

After a meeting with our local humanities computing director, I decided that it would probably be worthwhile to begin recording the individual submission data for the poems in my database. So rather than simply count the number of instances in a particular journal, the idea would be to track every single submission as a separate and distinct edge (or link). Adopting such an approach should also make it easier to create dynamic network visualizations that represent change over time.

With this in mind, today I began to work with a very small dataset to test how the workflow might look when going from the kind of database described above (with each poem constituting a link) to any of the SNA analysis and visualization tools. I had little trouble loading the edgelist into UCINET or ORA, and both programs automatically translated the multiple links between a poet and a journal into a link weight. The problem, however, is that both programs did this by reducing the connection to a single line, which then rendered irrelevant the data for each poem (i.e., title, date). I can imagine adding this information as an edge attribute in a program like Cytoscape, but given the editing limitations, this could mean a lot of labor intensive input. For while I can easily bring up a list of edges and add an attribute, I can't seem to sort this edge in the same way as my database. And I'm still stuck with the problem of having all submissions to one journal condensed into a single edge. Is there a way to prevent this from happening?

Update (8/26): Soon after writing this post I discovered that Cytoscape can be used to import the edgelist from a .CSV file, and that rather than condensing the edges into a single weighted line, it preserves each and every one as a distinct link. This is extremely useful for me, though the graph itself (at least for the total time span) is obviously quite messy.

Tuesday, July 12, 2011

From Two-Mode to One-Mode Datasets in UCINET

Despite the terribly bland title that I've given this entry, there is actually much here that will be of use in learning how to transform a bipartite affiliation graph (i.e., poet X journal) into two separate one-mode graphs that show weighted connections between poets or journals based on their original connections. To be able to transform the graph in such a way allows us to see how poets are related to one another based on their level of participation in particular journals.

To start off, I took the original data culled from the "Modernist Poetry" reference and put it into three columns: Poet, Journal, and Strength (i.e., number of contributions to that journal). Thus for each instance where a poet contributed to a specific journal, I list the poet's unique ID, the journal's unique ID, and the number of times a contribution was made (i.e., "P1 | 1 | 2" indicates that Poet 1 contributed to Journal 1 exactly two times). See "ModPoetsGakkoVal.xls" to see what this looks like.

Because the UCINET program makes it easy to transform two-mode networks into one-mode networks, the next step was to get the data into a format that could be read by the program while also insuring that the labels and weights were input correctly. To do this, I simply needed to insert the proper commands at the head of the file "dl n=106, format=edgelist2"; indicate that the labels were already embedded; and then copy and paste the three columns of date created in Excel. Having done this, I was able to load the data into UCINET and produce a graph like this, with weights of edges embedded as attributes:

The red circles represent the poets and the blue squares represent the journals. The weights are not visible on this image as it would make it difficult to see much of anything.

Having gotten this far, I then took the UCINET file I had created and ran it through a function that split the graph into one-mode networks (see the following for instructions on how to do this). Specifically, I created a one-mode network showing only the poets and their weighted connection to one another. The weights were computed according to the sum of all minimum values shared by any two poets. Thus if P1 contributed to a journal 2 times and P5 contributed to that same journal 5 times, the weight would be resolved as 2. This value would then be computed for each instance in which the poets contributed to the same journal, and the sum of these minimums then became the weight linking the two individuals. The graph that resulted looked like this:

In other words, very messy. There's not much to be gleaned from this image alone, since it simply tells us that each poet is connected to every other poet. And this is to be expected given that our dataset was formed out of a list of individuals who all contributed at least once to the same journal. Where the graph becomes more interesting, or so I hope, is when we display the weights of the edges and start running analyses based on that information. I will begin to do this in the days ahead as I try to assess the kinds of algorithms that can be run on weighted, directed graphs, but for now I will leave you with this small tidbit.

Saving the above graph as a Pajek (.net) file, I loaded it into the Sci2 software so that I could use GUESS to play around with the colors and size of the edges (I find GUESS and CYTOSCAPE provide a much more intuitive interface for doing this than does NETDRAW). I then trimmed (or hid) all the edges with a weight less than 70, which I should note was a purely arbitrary choice. I'll play around with this upper limit in the future. What resulted was this:

As you can see, most of the nodes are now isolated and all that remains is a core group of five nodes with two outliers connected to the core. Changing the visualization slightly to reflect the weights of the edges, we get this image:

The next step, of course, is to see who these remaining individuals are. I'll leave this for a later entry. I think it would also be useful to think some more about how the calculation of weights might or might not translate to the reality we are ultimately trying to capture. In real historical terms, what does it mean to say that the value of the connection between two poets contributing to the same journal is only as strong as the "weaker" of the two?

Friday, June 17, 2011

Working with Sci2 and Cytoscape

Weighted Bipartite Network -- Thickness of line indicates relative number of contributions

This peculiar, fish-like graph is the result of another morning's efforts to get my initial data into a meaningful and interpretable form. Using the Sci2 Tool (developed at Indiana University for use in analyzing scientific data and co-author networks in scientific literature), I loaded a .CSV file containing my bipartite network data and edge attributes (i.e., weight = # of contributions). I then extracted a bipartite network from this file and used the Cytoscape visualization tool to begin looking at the results.

While the workflow just described might sound straightforward enough, I did run into trouble loading the edge attribute data into Sci2. Failing to find a solution, I ended up entering the data manually in the Cytoscape application, a process made somewhat easier by the fact that I could sort the edges into a list roughly similar to the list contained in my original .CSV file.

Once the data was entered, I could begin the fun part of manipulating the visualization into something that was readable. So, for example, I set the poet and journal nodes to be different shapes and colors. The yellow triangles on the outer rim represent the journals and the green squares just to the right of center represent the 25 poets who makeup my initial database. I then set the thickness of the lines to reflect their weight, with the thinnest lines representing a value of 1 and the thickest representing a value of 48. I also split the journals so as to see more readily which of them had the most contributors in common. Besides 「学校」, of course, these journals included 「詩神」, 「太平洋詩人」, 「暦程」, and 「弾道」.

As I did these manipulations, a couple of questions and ideas arose as to what we might be able to learn and display just at the level of visualization. Would it be interesting, for example, to show just the journals that had contributions from a majority of the poets involved? Might this reveal some of the underlying groupings in the poetic field once we compared it with poets involved in another modernist journal from the period? Could we add a temporal dimension by displaying only those journals that were in publication during some smaller unit of time (e.g., one year)? What would the graph look like if we set 「学校」 at the center, set the poets at the next level up, and then had the rest of the journals form an outer ring or a line at the top? Might this give a better sense for how weakly or strongly connected these poets were in relation to other journals? These are some of the things I would like to continue to work on next week. I think I will also return to the UCINET program to see if I can extract the poet-to-poet and journal-to-journal graphs that will hopefully prove most conducive to meaningful network analysis (i.e., measuring for betweeness, centrality, etc.)

My apologies for the technical and overly detailed nature of these posts. They do not excite in the way that meta-level analysis hopefully will, but it's only by slogging through the mud that we can reach the other shore.

Thursday, June 16, 2011

Back to Work

After a long hiatus, I've finally been able to find time to get back into my SNA work. I completed my preliminary data entry last month, which involved inputting information for all poets affiliated with the poetry journal "Gakko" and the number of contributions each made to modernist poetry journals between 1920 and 1944. Today, my primary goal was to input the data into some SNA analysis tools and see if I could produce a two-mode graph in which links are weighted according to number of contributions made to a particular journal. I was able to do this quite easily in ORA, which then allowed me to produce a predictably messy graph. But what I was really keen to try out was to load the weighted data into UCINET so as to pull apart the affiliation network and see how the poets are connected to one another through the journals. This proved more difficult than I had thought, as ORA couldn't output to UCINET format. Without this step, I will likely have to input the weights manually in UCINET and go from there.

Beyond such technical details, the broader problem that I need to begin addressing is what I'm going to be able to do at the analytical stage that will reveal anything interesting. Part of the problem is learning what sort of analytical algorithms are possible and sensible given the structure of the underlying data. The other part of the problem is coming up with the right questions to ask about the data. Is it centrality measures that I am most interested in? Am I looking for clusters within the network? Do I want to create a kind of network DNA fingerprint for each poet to see how the participation of each does or does not align with others? I guess what I need to really think about are the questions that can't be answered when one has just individual data. Or even if the results do point to realities that would seem obvious to anyone familiar with the poet, the point is to look at the results in aggregate and consider what they might be able to tell us about the modern field of poetic production seen from a broader scale.