2024 Fall ISA 5810: Data Mining: Concepts, Techniques, and Applications

Syllabus

Orientation

9/2 for 3 hours

During the orientation session, you'll have the opportunity to acquaint yourself with the course structure, meet your instructor, and connect with fellow classmates, fostering a collaborative and engaging learning environment. Additionally, we will provide a comprehensive overview of the course content, setting the stage for a productive and enlightening educational journey.

Activities

Reading: Syllabus
Join Teams
YouTube
For those unable to attend the initial session, kindly review the recordings available on NTU Cool or Teams and take Orientation Quiz
The one-minute reflection summary could be found here.

Interesting Videos

How We Used Data to Win the Presidential Election by Dan Siroker

Overview and Data

9/9, 9/16 for 6 hours

Mastering and optimizing data stands as a pivotal phase in the comprehensive process of data mining activities. In this session, an introduction to the diverse attributes and distinct characteristics inherent in datasets will take center stage. This will transition into a deep dive into various data preprocessing techniques essential for effective data analysis.

Following this, a range of similarity and distance measures will be explored, serving as vital tools for discerning patterns and trends within the data. To conclude the session, an immersion into the art of data visualization will take place, showcasing a potent tool that aids in the intuitive representation and interpretation of complex data structures.

Activities

Join NTU Cool. For students from Chaoyang University, National Yang Ming Chiao Tung University, National Tsing Hua University, National Cheng Kung University, Southern Taiwan University of Science and Technology, Tatung University, and National Taiwan Normal University who enrolled in the class before September 3rd, NTU Cool has already sent out the invitation emails. If you haven't received one, please check the email address you provided through your school, as the invitation should be there.

Lab for Data Exploration and Management

9/23

During this lab session, emphasis will be placed on utilizing scientific computing libraries for the adept processing, transformation, and management of data. Moreover, participants will be acquainted with practices and introduced to cutting-edge visualization tools, fostering effective big data analysis.

Activities

Class is offered in YouTube only
Assignment One should be submitted before Oct 27

Classification

9/30, 10/7 for 6 hours

Classification, often identified as supervised learning, stands as a focal point in the spheres of data mining and machine learning. The primary objective here is to categorize input data into defined classes, enhancing the accuracy of predictive analyses.

In this session, crucial algorithms integral to classification techniques will be explored. The discussion will commence with an analysis of Decision Trees, utilizing a tree-like graph structure for strategic decision-making. This will transition into a study of Bayesian Networks, central tools for deducing probabilities and making informed predictions by analyzing the statistical relationships between different variables. Subsequently, the focus will shift to Neural Networks, potent frameworks adept at deciphering complex patterns and facilitating precise predictions. The session will conclude with an overview of Convolutional Neural Networks (CNNs), vital instruments in the realm of visual imagery analysis, notably in tasks involving image and video recognition.

This session aims to impart a comprehensive understanding of the core principles and subtleties of classification, furnishing participants with the skills vital for success in data mining projects.

Activities

Reading: Pedro Domingos, A few useful things to know about machine learning, Communications of the ACM, Volume 55, Issue 10, October 2012, pp 78–87

Text Mining

10/14, 10/21 for 6 hours

Text mining operates as a method for gleaning essential insights from unstructured textual data, commonly employing Natural Language Processing (NLP) techniques such as lexical and syntactic analysis, and inference methods.

In this session, advanced computational methodologies like the Word2Vec algorithm will be discussed, highlighting its role in mapping word relationships through vector spaces. The conversation will also introduce Transformers, which enable efficient sequence processing, and Large Language Models, renowned for their expansive text generation and comprehension capabilities. A segment on ChatGPT will illustrate its significance in modern applications such as chatbots and content creation, underscoring the current innovations in the text mining domain.

Activities

Lab for Deep Information Retrieval and Neural Word Embeddings

10/28 for 3 hours

During this lab session, hands-on practice will take center stage, guiding participants through the utilization of information retrieval techniques for the modeling, training, and classification of textual data. The session will offer practical exposure to advanced deep learning frameworks such as word2vec, doc2vec, and FastText. Furthermore, participants will have the opportunity to engage with traditional text classification approaches like KNN, SVM, and Naive Bayesian, enabling a comprehensive, practice-oriented understanding of the diverse techniques utilized in the field.

Activities

Class is only offered in YouTube
Assignment Two should be submitted before Nov 26

DM Clustering & Project Progress Report

11/4, 11/11 for 6 hours

Cluster analysis serves as a technique to group objects such that those within the same cluster exhibit higher similarity to each other compared to those housed in separate clusters. Initially embraced within the realms of pattern recognition and signal processing, these clustering strategies have expanded their influence into many other domains. This session will present a deep dive into a range of clustering techniques, emphasizing key algorithms such as K-Means for partitioning, Hierarchical Clustering which forms a tree of clusters, Density-Based Clustering that groups together points with sufficient proximity, and aspects of Cluster Validity which assesses the quality and reliability of the clusters formed. This discussion aims to furnish attendees with a robust understanding of these pivotal clustering algorithms and their practical applications.

Activities

11/4 Project Progress Report

Association Rules

11/18, 11/25 for 3 hours

Association rules learning delves into identifying meaningful relationships between variables in large datasets, using metrics such as interestingness and confidence measures to pinpoint strong rules that arise from data analysis. This session will provide a succinct introduction to the core concepts of association rules, along with an overview of the Frequent Pattern Growth algorithm and key techniques for Pattern Evaluation. Participants will be equipped with the knowledge to effectively apply these techniques in real-world scenarios

Activities

Classes are also offered in Teams
Reading: J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95
Reading: J Han, J Pei, Y Yin, R Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery, 2004 - Springer
Reading: N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999
Reading: R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables”. ACM SIGMOD96

Activities

11/18 Project Progress Report

Examination

12/2 for 3 hours

Time to evaluate. Different from other examination in our life, we do not want to assess how much we remember. It is more important to know how much we understand. Hence, each student can bring one A4-page paper with all kinds of notes into the classroom. Enjoy.

Notes

Students can take one A4 page with them
The locations will be annouced through emails

Student Presentation & Discussion

12/9 for 3 hours

Participants will engage in a collaborative exploration of a specified paper using the Jigsaw reading approach. Each student will be entrusted with understanding a particular section of the paper in depth, with the goal to elucidate their findings to group members. This initiative encourages not only a profound individual comprehension of the material but also fosters a synergistic learning environment, where aiding group members in grasping complex concepts becomes paramount. It’s a step towards nurturing a learning community where knowledge is mutually shared and amplified through collaborative discussion.

Activities

TAs will give the assignment.

Final Project Demo

12/16 3 hours

Culminating in a display of knowledge acquired through learning, analysis, and execution, this final project demonstration stands as a testament to your grasp of data mining principles throughout this course. Through this initiative, participants could also gain valuable experience in collaborative teamwork.

Requirements

Each group should generate 4 minute youtube clips to show in the class
Final project requirement description will be given through emails

2024 ISA 5810

About ISA5810, Fall 2024

Text Book

Time in 2024

Location:

Instructor:

Yi-Shin Chen

Teaching Assistants:

Didier Salazar

Kuan-Hao Yeh

Po-Yung (Joe) Huang

Gerraldo Candra

Retnani Latifah

Meng-Chieh Tang

Hao-Ze (Arthur) Wang

Orientation

Activities

Interesting Videos

Overview and Data

Related Videos

Activities

Lab for Data Exploration and Management

Activities

Classification

Activities

Related Videos

Text Mining

Related Videos

Activities

Lab for Deep Information Retrieval and Neural Word Embeddings

Activities

DM Clustering & Project Progress Report

Related Videos

Activities

Association Rules

Activities

Related Videos

Activities

Examination

Notes

Student Presentation & Discussion

Activities

Final Project Demo

Requirements