wiki:TIES445
Last modified 5 weeks ago Last modified on 2018-05-14 12:02:40

TIES445 Data Mining, spring 2018

Time: March, 12th - May, 23th

Lecturers: Tommi Kärkkäinen, Mirka Saarela, Joonas Hämäläinen, Sami Äyrämö

Exercises: Joonas Hämäläinen

Information in Korppi

Materials in Koppa

Live feed

Main contents and further reading

Main Course material

On Data Mining project

  • Done in groups of 2-3 students (under special circumstances also individually, e.g. if purely distance learning student)
  • Amount of work roughly 1-2 ECTS (30-50 hours of work)
  • Data + It's domain + some DM algorithm(s) + Interpretation + Presentation to others
  • Presentations in Seminar on Wed 16, May, 14:00-20:00
  • Orientation can be technical
    • Distributed DM on Hadoop for big data
    • Introduce a DM tool to others and do something with it and compare to M-stuff in exercises
  • or applied
    • Utilize standard techniques for own data set
    • Consult the possible application domains as listed in Lecture 1, e.g.
      • telecom traffic/protocol as original data source
      • software code or software project deliverables as original data source
      • audios, or images, or videos as original data source
  • or scientific
    • Take one or a few recent scientific publications on interesting/related topic, review, and do some testing
  • or something else: think, make a suggestion and let's discuss further
  • Start thinking groups and theme+orientation now!
  • Inform me (the lecturer) about groups and topic (after discussions e.g. after lectures or during exercises)
    • One from the group can add project information below after logging in here

Active DM projects

  1. Milla Koivuniemi ja Kimmo Riihiaho: UPSP dataset + classification
  2. Antti Kariluoto ja Petri Vähäkainu: Analyzing KIRA data
  3. Pinja Pesonen, Vilja Koski ja Veera Tiainen: Educational data clustering
  4. Päivi Nummelin ja Kasimir Ilmonen:
  5. Sami Kyyhkynen ja Merja Halonen: Global competitiveness index data clustering
  6. Samu Kumpulainen ja Marko Raatikainen: Social media & movies clustering
  7. Arttu Ylä-Sahra: Analysis of patterns and statistical findings of a SSH log
  8. Ossi Jormakka: Analyzing software bug reports
  9. Lauri Kantola, Matti Viljamaa ja Jarno Kiesiläinen: Dorothea dataset binary classification
  10. Janne Mäyrä, Riikka Vilavaara: Predicting the success of live Kickstarter projects
  11. Jesse Kananen: Estimating runner's contact time with accelerometer sensor

Some topics of DM projects during earlier semesters

  • Classification of a given dataset - utilizing and/or comparing classifiers (UCI datasets)
  • Human Activity Recognition (UCI dataset)
  • Knowledge Discovery on Pop Music Makers
  • Network Traffic Analysis
  • Fisher's linear discriminant (own generated data)
  • Analysis of Finnish language content mined from old books (via Project Guttenberg)
  • Knowledge Discovery on Computer Games
  • Knowledge Discovery from Enterprise Data
  • Knowledge Discovery from Temporal Network Data
  • Knowledge discovery from job announcements
  • Optical character recognition
  • Sentiment analysis
  • Dimension reduction for cybersecurity
  • scikit-learn
  • Analysis of Tanzanian water pumps

Open Data Repositories

Note: Many of these repositories are redundant, i.e. they contain data sets that are obtained or linked to other similar repositories. Note2: I tried to avoid those data set storages which need special software for reading data (though Matlab's .mat-format exluded on this)

Some Matlab toolboxes

A few links to own research activities which can be referred during the lectures to exemplify the concepts

Support for exercises