Computational design is my major. Because computational design includes broad fields, I think I don't have enough knowledge about it, even if it’s my major.
Computational design includes multiple concepts and crossing fields in design and computation, redefining innovative computer-aid design methods through digital, physical, interactive computing.
The “Computer revolution” significantly affects design theories and practices, due to cybernetics, Artificial Intelligence, and the linked transformations. Computational design discovered many significant design ways, tools, and concepts on our conceptions of design, creativity, nature, body, and place. —— Daniel Cardoso Llach
Through computational visualization analytics based on several conferences in computational design fields, including ACADIA, CAADRIA, CAAD etc., the project aims to find what core research contents are in computational design.
In this project, I keep the data pipeline to scrape data from web, to clean data and design a structured database, to explore the data using programming libraries. My major purpose contribution is to give some new insights into the field of computational design.
Based on all the above, I propose two questions:
What are important topics or keywords of computation design in the past 60 years?
How significant authors and institutions cooperated and impacted each other, like citation or co-author?
Accountability: About This Data Set
The data comes from CulminCAD.
Cumincad is a website that hosts a vast library of CAAD scholars. It has indexed and shared almost all major CAD conference’s proceedings since a long time ago. As of March 2015, they have 14016 records from journals and conferences such as ACADIA, ASCAAD, CAADRIA and …. Exploring the dataset will help us answer the question. The data used in the project covers all publications saved in CulminCAD.
Data coming from CulminCAD just can be access by the HTML format. Fortunately, because of the data format on Cumincad website's HTML is strongly structured, I can fetch the data from this web page and save them into as "csv". In this project, I saved several useful attributes. The database structure I designed was:
TABLE `papers` (
`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT,
`search` varchar ( 256 ) DEFAULT NULL,
`authors` varchar ( 1024 ) DEFAULT NULL,
`year` integer DEFAULT NULL,
`title` varchar ( 1024 ) DEFAULT NULL,
`source` varchar ( 1024 ) DEFAULT NULL,
`summary` text COLLATE BINARY,
`keywords` varchar ( 1024 ) DEFAULT NULL,
`series` varchar ( 1024 ) DEFAULT NULL,
`content` varchar ( 256 ) DEFAULT NULL,
`url` varchar ( 1024 ) DEFAULT NULL,
`email` varchar ( 1024 ) DEFAULT NULL
The following exploration aims at the previous questions I proposed. Due to the database contains 10105 authors and their authorship papers, I want users can explore their co-authorship in a large interactive force graph.
And, from the large data bank of publication summaries, I want to generate the core topics and keywords in the fields.
I use LDA - a Topic Modeling Method to classify 20 topics based on the publication summary(abstract).
The model was set in 250 features, to classify those summaries into 20 topics, and each topic includes 5 words.
To avoid repetitive vocabularies in singular and plural, and to remove some meaningless words like "job" or "university", I filtered some words.
Finally, I highlighted some vocabularies which I think can represent the topic.
Based on the unsupervised machine learning model and the 20 topics generated, the prediction can classify a paper's topic from its abstract or summary. In the following textbox, you can paste paragraphs of the abstract to classify the topic.
Please paste the abstract in the following box to classify the topic.
In the project, the data exploration gives a good outcome to answer the initial questions, about author collaboration in computational design, the core research topics in the fields. The challenge is in the step of data collection, data cleaning and co-author network visualization for the large dataset. The risk is because there is no public structured dataset, it's hard to identify unique author name, and how to optimize co-author visualization for those over 10,000 nodes. But finally, those problems are resolved in appropriate methods.
From the project, I learned and applied the entire data pipeline to collect data from a website, to clean data into the structured and well-designed database, and to explore the data using programming data visualizing ways in order to answer proposed questions. I also learned how to build a user-friendly website and visualization application. I find suitable interactive ways to present graph even though it's a large dataset. I'm glad to apply machine learning methods to the real data projects.
It is only the first phase to answer my thesis questions. I'll continue research on finding new insights for computational design fields using technical ways and applying interdisciplinary knowledge such as bibliography and social network. As the first prototype, it is a successful attempt to explore the concepts and verify potential methods, but it is not the end. The project just includes CuminCAD data, but computational design still contains more disciplines, waiting to explore.
For the next step, I'll try to fetch more data for other related conferences from other digital libraries. If possible, to find the distribution of researchers based on different genders and locations is one of my targets, even if the author attributes are difficult to get. Then, the representation of the large dataset will still be the most important challenge.