Student Members: Tremayne Booker, Jared Vitug
Faculty Advisor: Frank Witmer
Acknowledgments: ANSEP, Wikipedia
Abstract:
Wikipedia is one of the world’s largest publicly available network structures. With articles as nodes, and the links between them as edges, we allow people to visualize and interact with this network through our website: WikiConnect.
Architecture:
The website has two main components: the front end and back end which interact with each other by passing JSON files via flask.
The front end is coded mostly in Javascript. Stylization was done with “bootstrap” and “fontawesome”. Although most of the heavy lifting was through the plugin “Cytoscape”. Cytoscape is a plugin that allows for a multitude of options to represent data as networks. Our graphs use the “Cola” layout. Cytoscape uses force simulations to animate the graphs as they are generated or moved, with the cola layout giving control over these simulations. The graphs also require two other plugins, “Popper” and “Tippy” to create info boxes when hovering over nodes.
The back end is mostly in Python. It handles requests for data. Whenever the front end needs to create a graph, it asks the back end what data is required to create it. It does so by collecting data from Wikipedia’s API.
These are connected by “Flask”: a Python package to integrate front and back ends. Flask allows JavaScript code to directly use Python functions, as long as they return values acceptable by Javascript. For this, the back end sends JSON files.
Data Collection:
To perform an analysis of the Wikipedia network, the entire network needed to be downloaded. Doing so was more difficult than originally planned as Wikipedia has a limiting throttle for API accesses. Data was collected over 12 days across multiple computers. These were stored in data frames which were later combined to complete the network
Further Work:
The biggest improvement to be done is to implement a larger-than-memory model to analyze the entire network rather than subsections. This will let us implement a “Shortest Path” finder, the framework of which can be seen in the “Pathfinder” image. Packages like “Dask” allow easy larger than memory management and will hopefully help wrangle the data to be more manageable.
Comments are closed