Overview

Computational design is my major. Because computational design includes broad fields, I think I don't have enough knowledge about it, even if it’s my major. Computational design includes multiple concepts and crossing fields in design and computation, redefining innovative computer-aid design methods through digital, physical, interactive computing.

The “Computer revolution” significantly affects design theories and practices, due to cybernetics, Artificial Intelligence, and the linked transformations. Computational design discovered many significant design ways, tools, and concepts on our conceptions of design, creativity, nature, body, and place. —— Daniel Cardoso Llach

Through computational visualization analytics based on several conferences in computational design fields, including ACADIA, CAADRIA, CAAD etc., the project aims to find what core research contents are in computational design.

In this project, I keep the data pipeline to scrape data from web, to clean data and design a structured database, to explore the data using programming libraries. My major purpose contribution is to give some new insights into the field of computational design.

Based on all the above, I propose two questions:
  • What are important topics or keywords of computation design in the past 60 years?
  • How significant authors and institutions cooperated and impacted each other, like citation or co-author?

Data Overview

Accountability: About This Data Set

The data comes from CulminCAD.

Cumincad is a website that hosts a vast library of CAAD scholars. It has indexed and shared almost all major CAD conference’s proceedings since a long time ago. As of March 2015, they have 14016 records from journals and conferences such as ACADIA, ASCAAD, CAADRIA and …. Exploring the dataset will help us answer the question. The data used in the project covers all publications saved in CulminCAD.

Data Collection

Data coming from CulminCAD just can be access by the HTML format. Fortunately, because of the data format on Cumincad website's HTML is strongly structured, I can fetch the data from this web page and save them into as "csv". In this project, I saved several useful attributes. The database structure I designed was:
TABLE `papers` (
`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT,
`search` varchar ( 256 ) DEFAULT NULL,
`authors` varchar ( 1024 ) DEFAULT NULL,
`year` integer DEFAULT NULL,
`title` varchar ( 1024 ) DEFAULT NULL,
`source` varchar ( 1024 ) DEFAULT NULL,
`summary` text COLLATE BINARY,
`keywords` varchar ( 1024 ) DEFAULT NULL,
`series` varchar ( 1024 ) DEFAULT NULL,
`content` varchar ( 256 ) DEFAULT NULL,
`url` varchar ( 1024 ) DEFAULT NULL,
`email` varchar ( 1024 ) DEFAULT NULL );

Data Completeness

In fact, because of the data was scraped by my scripts, not directly get from those public data sets, the data has some missing values. In general, I evaluate the data to mid-level completeness.

The left diagram expresses data missing in the dataset. There are primary 6 types of missing data.
1. Missing Content
2. Missing Keywords
3. Missing Items
4. Missing Series
5. Missing Source
5. Missing Summary

Data Correctness

In my research, it's correct.

First, it is supported by official organizations. CuminCAD is a Cumulative Index about publications in Computer Aided Architectural Design supported by the sibling associations ACADIA, CAADRIA, eCAADe, SIGraDi, ASCAAD and CAAD futures. My advisor, research on Computational Design, also recommended my about CuminCAD.

And second, it is public and open-source, so it's verifiable. From 2016 on the repository was directed towards Open Access and on top of this fully relaunched by Prof. Tomo Cerovsek, University of Ljubljana (Slovenia).

Then, all data on CuminCAD has the corresponding pdf of the publications to download. I have checked some data, to confirm the author, title, keywords, and etc. in pdf are corresponding to the database. I even searched some author to see if they have ever published them.

Data Coherence

I think the data makes sense relative to itself, it matches my expectations, and the distributions are sensible.

1. The database I scraped has 13990 records. All fields are saved as String.

2. All values are coherent to its attributes and type. Year are ranged from 1954 to 2018. And the distribution of year is reasonable.

3. All values of "author" are comprised of first name and last name. When I clean the data, I split the value of authors into some columns, and each column only includes one author. So, I can see the cooperative relations of authors.

Data Cleaning

Because I want to see the co-authorship for those thousands of authors, so I need to clean the database and design it with authorship tables.

The attribute of "author" is not structured, because the separator of different author names can be ";", "," or "and". But sometimes "," means the separator of first name and last name.

So, I have separate the author list, using semicolon, comma symbol and the word "and" to separator author, but not separate when a word is too short, to avoid excessive separation.

The cleaned dataset is saved in three table:
Author - to save author id and author name
Authorship - to save author id and paper id
Papers - to save paper id and other paper attributes

Then, the author table has too many repetitive authors, so, I rescraped author data from another website: https://cumincad.architexturez.net/, to get the whole accurate author table.

Now, the data is available on Google Fusion Table.

Data Analysis

At first, in order to visualize the data clearly at a high level and to see the trend of publication numbers in the past 60 years, the first analysis is based on the timeline of years. The following interactive curve chart helps users to explore the number of different conference publications.

Data Exploration

The following exploration aims at the previous questions I proposed. Due to the database contains 10105 authors and their authorship papers, I want users can explore their co-authorship in a large interactive force graph.

And, from the large data bank of publication summaries, I want to generate the core topics and keywords in the fields.

Co-Author Network

One of my targets is to find the co-authorship in computational design.

There are 10105 authors and 23870 authorships(author to paper) in the database. Visualize the huge network is the biggest challenge in the project. I apply the framework - sigmajs(focusing on graph visualization) to programming.

The following graph is the outcome. Maybe you should wait a couple seconds for initial loading.

In the network, one node is for one author, the edge between two nodes represent they had collaboration in the past. The closer are two nodes, the more numbers of collaboration they have. The color and size of a node represent his/her co-author times. The larger the node and the darker the color describe the more collaboration of the author.

In the section, users can zoom in and out to see details and abstract. And, clicks one node will result in selection of the specific author's network. Click the blank area to recover.

Classification

I use LDA - a Topic Modeling Method to classify 20 topics based on the publication summary(abstract).

The model was set in 250 features, to classify those summaries into 20 topics, and each topic includes 5 words.
To avoid repetitive vocabularies in singular and plural, and to remove some meaningless words like "job" or "university", I filtered some words.
Finally, I highlighted some vocabularies which I think can represent the topic.

Classify for an abstract

Based on the unsupervised machine learning model and the 20 topics generated, the prediction can classify a paper's topic from its abstract or summary. In the following textbox, you can paste paragraphs of the abstract to classify the topic.

Please paste the abstract in the following box to classify the topic.

Summary

In the project, the data exploration gives a good outcome to answer the initial questions, about author collaboration in computational design, the core research topics in the fields. The challenge is in the step of data collection, data cleaning and co-author network visualization for the large dataset. The risk is because there is no public structured dataset, it's hard to identify unique author name, and how to optimize co-author visualization for those over 10,000 nodes. But finally, those problems are resolved in appropriate methods.

From the project, I learned and applied the entire data pipeline to collect data from a website, to clean data into the structured and well-designed database, and to explore the data using programming data visualizing ways in order to answer proposed questions. I also learned how to build a user-friendly website and visualization application. I find suitable interactive ways to present graph even though it's a large dataset. I'm glad to apply machine learning methods to the real data projects.

It is only the first phase to answer my thesis questions. I'll continue research on finding new insights for computational design fields using technical ways and applying interdisciplinary knowledge such as bibliography and social network. As the first prototype, it is a successful attempt to explore the concepts and verify potential methods, but it is not the end. The project just includes CuminCAD data, but computational design still contains more disciplines, waiting to explore.

For the next step, I'll try to fetch more data for other related conferences from other digital libraries. If possible, to find the distribution of researchers based on different genders and locations is one of my targets, even if the author attributes are difficult to get. Then, the representation of the large dataset will still be the most important challenge.