Step 0: Load the libraries

library("network")
library("sna")
library("RColorBrewer")
library("ggplot2")
library("GGally")
# Must load other packages first
library("sand")
library("intergraph")
library('scales')
library('mcclust')

PART I

Step 1: Load data

ing <- read.graph("ingred.graphml", format="graphml")
summary(ing)
## IGRAPH U-W- 1106 19679 -- 
## + attr: label (v/c), id (v/c), weight (e/n)
is.bipartite(ing)
## [1] FALSE

Step 2: Create histogram of edge weight

wts <-E(ing)$weight
min(wts)
## [1] 0.00696341
max(wts)
## [1] 1.52018
p <- ggplot(data=data.frame(x=wts), aes(x=x))
p <- p + geom_bar() 
p <- p + scale_x_continuous('Weight', breaks=seq(0,2, by=0.1))
p <- p + ggtitle('Edge distribution')

Step 3: Remove edges with weight less than 0.50

ingFiltered <- delete_edges(ing, E(ing)[(E(ing)$weight<0.5)])
summary(ingFiltered)
## IGRAPH U-W- 1106 2010 -- 
## + attr: label (v/c), id (v/c), weight (e/n)

Step 4: Distribution of weight degree less than or equal to 2.0

deg <- graph.strength(ingFiltered)

max(deg)
## [1] 23.255
p <- ggplot(data = data.frame(x=deg),aes(x=x))
p <- p + geom_histogram(breaks =seq(0,2, by=0.1), fill='azure2',col='blue')
p <- p + scale_x_continuous('Weighted degree', breaks=seq(0,2, by=0.1))
p <- p + ggtitle('Distribution for weighted degrees less than or equal to 2.0')
print(p)

Step5: Remove singleton nodes.

deg <- degree(ingFiltered, normalized = T)
j = 1
for (i in V(ingFiltered)) {
  ingFiltered <- set.vertex.attribute(ingFiltered,'degree',i,deg[j])    
  j = j + 1
}
ingFiltered <- delete_vertices(ingFiltered,V(ingFiltered)[V(ingFiltered)$degree==0])

labels <- V(ingFiltered)$label
deg <- degree(ingFiltered, normalized = T)
min(deg)
## [1] 0.001122334

PART II: Community detection

Step 1: Algorithms

Leading eigenvector community detection find 44 groups with a modularity of 0.7

ComEigen <- cluster_leading_eigen(ingFiltered,weights = E(ingFiltered)$weight)
ComEigen
## IGRAPH clustering leading eigenvector, groups: 44, mod: 0.7
## + groups:
##   $`1`
##    [1]   1   2   3   4   5   6   9  10  12  22  38  39  45  46  56  66  69
##   [18]  74  76  77  79  82  86  90 139 143 148 152 155 167 179 185 189 195
##   [35] 245 246 248 287 313 404 405 407 408 410 440 449 475 491 500 508 526
##   [52] 571 572 574 593 596 597 598 599 602 605 608 620 621 622 626 627 628
##   [69] 641 684 685 686 727 728 730 732 736 741 743 744 762 789 798 805 819
##   [86] 845 847 848 850 859 867 882 890
##   
##   $`2`
##   [1]  54 808
##   + ... omitted several groups/vertices
ComEigen$modularity
## [1] 0.6985998

By Fastgreedy there are 25 groups with a modularity of 0.76

ComFG <- cluster_fast_greedy(ingFiltered)
ComFG
## IGRAPH clustering fast greedy, groups: 25, mod: 0.76
## + groups:
##   $`1`
##     [1]  20  24  80  93  94  95  96  97  99 100 101 106 107 113 114 118 120
##    [18] 121 124 126 127 128 129 133 144 192 205 212 229 230 232 254 255 258
##    [35] 259 260 261 262 263 268 269 270 274 282 284 288 327 350 362 368 379
##    [52] 387 392 413 414 415 422 423 426 427 430 432 434 435 437 438 439 444
##    [69] 445 446 456 457 461 462 470 472 479 483 489 493 495 496 501 502 505
##    [86] 506 584 615 678 679 733 745 765 782 813 825 829 839 861 865 870 876
##   
##   $`2`
##    [1]  21  28  44  46  47  74 104 109 111 130 132 138 140 142 151 160 163
##   + ... omitted several groups/vertices

By Edge betweenness we find 28 groups wih a modularity of 0.74

ComBet <- cluster_edge_betweenness(ingFiltered)
ComBet
## IGRAPH clustering edge betweenness, groups: 28, mod: 0.74
## + groups:
##   $`1`
##    [1]   1   2   3   4   6   9  22  28  39  40  45  66  69  72  74  76  77
##   [18]  90 139 143 148 151 155 195 245 248 404 405 410 431 440 449 455 475
##   [35] 594 596 598 605 620 622 641 680 684 685 686 700 727 728 730 732 737
##   [52] 738 741 743 762 789 792 806 845 848
##   
##   $`2`
##    [1]   5  10  11  14  19  26  30  83 112 141 146 150 153 156 157 161 174
##   [18] 178 194 198 256 264 266 386 576 595 703 734 784 785 793 851 869
##   
##   + ... omitted several groups/vertices

By Walktrap with the number of steps equal to 4 there are 92 groups with a modularity 0f 0.69

ComW4 <- cluster_walktrap(ingFiltered,steps = 4)
ComW4
## IGRAPH clustering walktrap, groups: 92, mod: 0.69
## + groups:
##   $`1`
##   [1] 115 297 337 585 771 776 879
##   
##   $`2`
##    [1]  75  91 271 272 273 296 302 323 329 331 336 351 358 375 487 512 514
##   [18] 516 521 522 523 578 603 632 695 767 774 775 820 852
##   
##   $`3`
##    [1]   8  12  17  27  32  36  42  59  67  71  82  84  88  89  90  92 145
##   [18] 158 200 215 305 321 467 468 469 477 499 507 527 601 624 631 639 666
##   + ... omitted several groups/vertices

By Walktrap with the number of steps equal to 10 there are 53 groups with a modularity 0f 0.73

ComW10 <- cluster_walktrap(ingFiltered,steps = 10)
ComW10
## IGRAPH clustering walktrap, groups: 53, mod: 0.73
## + groups:
##   $`1`
##    [1]  10  11  14  19  30  83 112 141 146 150 153 156 157 161 178 184 194
##   [18] 198 264 386 576 595 703 734 784 785 793 851
##   
##   $`2`
##   [1] 108 326 365 682 683
##   
##   $`3`
##    [1]  53 103 122 125 206 213 221 279 280 281 283 285 292 294 300 303 306
##   [18] 308 309 310 314 317 318 320 348 349 351 366 377 442 448 454 471 497
##   + ... omitted several groups/vertices

Step 2: Comments

In terms of the number of groups only two algorithms show similar numbers of groups, betweenness and fastgreedy. Betweenness is focuse on finding conections between different communities. In contrast the fastgreedy algorithm that identify once modularity is not increasing. Nodes with high betweenness showld not contribut positevely to the fastgreedy algorithm and are the key to identify communities in the firts algorithm. This algorithms seem to be counterparts, providing similar results in terms of modularity and number of groups. In addition this algorithm have the highest modularity.

Modularity among the algorithms similar, apparently the communities found have significan connection among the vertices.

Step 4:

ei.FG <- vi.dist(ComEigen$membership,ComFG$membership)
ei.Be <- vi.dist(ComEigen$membership,ComBet$membership)
ei.W1 <- vi.dist(ComEigen$membership,ComW10$membership)
FG.Be <- vi.dist(ComFG$membership,ComBet$membership)
FG.W1 <- vi.dist(ComFG$membership,ComW10$membership)
Be.W1 <- vi.dist(ComBet$membership,ComW10$membership)

distance <- data.frame(ei.FG,ei.Be,ei.W1,FG.Be,FG.W1,Be.W1)
y <- as.vector(c(ei.FG,ei.Be,ei.W1,FG.Be,FG.W1,Be.W1))
x = as.vector(c('ei-FG','ei-Be','ei-W10','FG-be','FG-W10','Be-W10'))
l = as.vector(c('olivedrab2','olivedrab2','olivedrab1','olivedrab3','olivedrab4','olivedrab3'))
l = as.vector(c('olivedrab4','olivedrab3','olivedrab4','olivedrab2','olivedrab1','olivedrab2'))

barplot(y,names.arg = x,main = 'Variation of Information Distance', ylab = 'Variation', col=l)

Step 5 Question: Fast Greedy has the highest modularity, which is not surprising since it directly optimizes that. Which clustering algorithm produces the most similar clustering?

Answere: Two other clustering have similar high modularities, betweenness 0.74 and Walktrap with 10 steps 0.73.

PAART III

Step 1: Ingredient

#labels
avocado = which(V(ingFiltered)$label=='cheese ravioli')

Step 2: Creating subgraphs for ingredient

I started with avocado, but the network are large. So I finally choose another Cheese ravioli.

for (i in 1:25){
  ingridient = which(ComFG$membership==i)
  if (sum(ingridient == avocado)>0){
    i
  }
}

for (i in 1:53){
  ingridient = which(ComW10$membership==i)
  if (sum(ingridient == avocado)>0){
    i
  }
}


avocadoFG = which(ComFG$membership==2)
avocadoNet <- induced.subgraph(ingFiltered,avocadoFG)
summary(V(avocadoNet)$label)
##    the 10 most common values are:
##       1% buttermilk alfredo pasta sauce       alfredo sauce 
##                   1                   1                   1 
##          applesauce        baking cocoa         basil sauce 
##                   1                   1                   1 
##      black eyed pea       bow tie pasta         bread dough 
##                   1                   1                   1 
##   butter shortening 
##                   1
avocadow10 = which(ComW10$membership==23)
avocadoNetW <- induced.subgraph(ingFiltered,avocadow10)
summary(V(avocadoNetW)$label)
##    the 10 most common values are:
## alfredo pasta sauce       alfredo sauce            asparagu 
##                   1                   1                   1 
##         basil sauce      black eyed pea       bow tie pasta 
##                   1                   1                   1 
##         bread dough      cheese ravioli curd cottage cheese 
##                   1                   1                   1 
##            eggplant 
##                   1

Step 3: Network plots

layout <- layout_with_kk(avocadoNet)
size <- degree(avocadoNet, mode = 'total',normalized = T) * 100
palet <- brewer.pal(9,'BrBG')
ramp <- colorRamp(palet, space = 'Lab', interpolate = 'linear')
bet <- degree(avocadoNet, mode = 'total' ,normalized = T)
bet.scale <- rescale(bet)
colors <- ramp(bet.scale)
rgbs <- rgb(colors, maxColorValue = 255)

plot(avocadoNet, layout=layout, vertex.size=size, edge.arrow.size=.3,vertex.color=rgbs, main = 'Avocado Network Fastgreedy')#vertex.label=NA

layout <- layout_with_kk(avocadoNetW)
size <- degree(avocadoNetW, mode = 'total',normalized = T) * 90
palet <- brewer.pal(9,'BrBG')
ramp <- colorRamp(palet, space = 'Lab', interpolate = 'linear')
bet <- degree(avocadoNetW, mode = 'total' ,normalized = T)
bet.scale <- rescale(bet)
colors <- ramp(bet.scale)
rgbs <- rgb(colors, maxColorValue = 255)
plot(avocadoNetW, layout=layout, vertex.size=size, edge.arrow.size=.3,vertex.color=rgbs, main = 'Avocado Network Walktrap ')#vertex.label=NA

Step 4: Save the network in GraphML format

write.graph(avocadoNet,'cheeseRavioliFG.graphml', format = c('graphml'))
write.graph(avocadoNetW,'cheeseRavioliW.graphml', format = c('graphml'))

STEP 5-6 Gephi Visualizations

Step7: Questions

Question: What type of food(s) do the clusters represent? The first visualizaiton belong to the community identify with the Fastgreedy algorithm. This community has two local networks or types of food. One that belongs to the ingredient I choose, which is cheese ravioli. This local network is related to Italian cousine. The second local network is food or ingredients derived from milk. There are two nodes within the Italian food that have strong edges to skim milk and egg substitute. This is the reason why this local networks are together in this community detected by Fastgreedy.

The second visualization is all about Italian food. There are no evident local networks whithin it.

There are tow main things that came as a surprise to see. Firts the ingredient had only one edge to spauetti sauce. I would have expected to see a edges to types of cheese, spinach which are common in ravioli. The second thing is that even when I selected a node with low degree the communi detection delivers all vertex related and so I was able to see its network.

The following visualizations are focus on the choosen ingredient in both communities.

Question: Which cluster captures the associated cuisine better? Why? (They could be equally good.)

The Fastgreedy algorithm objective is to maximize modularity. Since there are nodes with strong ties between Italian cousine and egg sustitues and skim milk the algorithm was able to keep improving modularity without recognizing that it had gone to another local network. On the other hand, walktrap tends to stay within same comunities. The result is that the Italian cousine is identify as a single community.