Here is the link to my class presentation on KDD & IR - Download file
I plan to write Master's report this fall. The topic will be about the connection or integation of Information Retrieval(IR) technology and information policy.
But, I haven't decided any specific topic. Do you have any idea about that?
"Information retrieval(IR) is very closely related to information filtering(IF) in that they both have the goal of retrieving information relevant to what a user wants, while minimizing the amount of irrelevant information retrieved" (Foltz & Dumais, 1992, p.52)
However, here is three primary differences between IR and IF.
First, user preferences in information filtering represent long-term interests, while queries in IR tend to represent a short-term interest that can be satisfied by the retrieval.
Second, information filtering is applied to streams of incoming data, while in IR, changes to the database do not occur often, and retrieval is not limited to the new items in the database.
Third, IF involves the process of removing information from a stream, while IR involves the process of finding information in that stream (Foltz & Dumais).
In trying to understand how data is discovered in and retrieved from databases, I've been trying to understand more about how different kinds of databases are used and what they are useful for. This is the best short summary I've found on the subject. Written in 2001. Object-Relational DBMSs - The Road Ahead
A take on TIA that bucks the opinion trend and focuses on the KDD and IR aspects of TIA. Written by Howard Bloom, author of Global Brain: The Evolution of Mass Mind. Wired 11.04: VIEW
Quote -- "One of TIA's component programs, Genisys, aims to totally reinvent the database, increasing its usefulness and its contents by an order of magnitude. It will be the database of databases, with an add-on "Babblefish" able to parse and cross-reference every possible information stream. The most ambitious TIA initiative, Genoa II, is working to produce cognitive amplifiers - a symbiotic thinking system that weaves together human and machine intelligences more tightly than ever before."
Privacy and Trust are major issues in promoting corporate portals' functions-gatherig, sharing and disseminating of information. Those issues are also related to all topics of KMS.
This article provides "new non-third party mechanisms to overcome" the barriers against privacy and trust, and also solutions for "finding shared preferences, discovering communities with shared values, removing disincentives posed by liabilities, and negotiating on behalf of a group" ,and techniques "to enable these new capabilities".
I found more specific information about the Knowledge Pump Sytem which we learned in "collaborative flitering" class.
The Knowledge Pump can foster an evironment that encourage the flow, use and creation of knowledge by supporting social network and electronic repositories.
.
My analysis of the Open Directory Project ODP is posted in .pdf form at http://www.ischool.utexas.edu/~khaack/TKMS/ODP.pdf
ODP is the world's largest human (volunteer)-edited Web directory. It can be used as a search engine but allows the user much more freedom in choosing their direction of searching than search engines that simply return ranked results.
Here is the ODP website (which is often slammed) http://dmoz.org
For those who would like more info about how XML works in/with databases-- XML and Databases //Quote//This paper gives a high-level overview of how to use XML with databases. It describes how the differences between data-centric and document-centric documents affect their usage with databases, how XML is commonly used with relational databases, and what native XML databases are and when to use them.//Endquote//
Uncovering Information Hidden in Web Archives. A view of KDD and IR from an archival perspective. A good primer on how data warehousing works from D-Lib Magazine.
Many classmates have understandable skepticism about blogs as a KM tool, but here's an interesting example of how blogs can be used to share tacit knowledge. A few months back, I posted a little story about a neighbor and her difficulties using Zote, a Mexican laundry detergent. Apparently someone from Zote found my site through Google, and posted a comment explaining Zote. I thanked him via email, and he responded with a more detailed discussion of Zote, which I've posted here. I never would have imagined someone would have found my question and answered it, but it certainly worked out.
I ran across this recent interview with James Michalko, the director of RLG about Information Access on the Wide Open Web
He hits a lot of issues related to the difficulties of searching on the Web. His comment on availability of scholarly resources on the Web, "Some firms, such as Amazon, have created algorithms and done the computational analysis that asks, 'Did you really mean this?' and says 'If this is what you want, then you will find the following things relevant.' We must deliver authoritative trusted information using those kinds of paradigms or we will simply become museums of long-term storage instead of current use. These are some of the ways in which the CS community could make accessible on behalf of the broad Internet community enormous amounts of wonderful resources that right now are either inaccessible or severely under used."
I had the assignment about information-seeking behavior in the first semester in U.S. The assignment required to interview someone with pre-formatted open questionnaire, therefore, which should not have closed questions such as yes or no.
While I was interviewing interviewee, who was seeing a doctor regularly because of her eyes problem, I realized the study of information-seeking behavior is fundamental to our field, library and information science.
I found good resource about it.
Solomon, P. (1997) "Conversation in Information-Seeking Contexts: A Test of an Analytical Framework." Library & Information Science Research 19 no. 3, 217-248.
Also available on the WWW. URL:
http://ils.unc.edu/~solomon/hp/ConInfo.html
Abstract
"This article develops an analytical framework to support the analysis of conversations in information seeking contexts. The framework brings together linguistic and sociolinguistic issues such as vocabulary, cohesion, coherence, turn taking, turn allocation, overlaps, gaps, openings, closings, frames, repairs, role specification, and stylistic features. These issues serve as viewpoints for exploring how information-seeking conversations differ from casual conversation and conversations in restricted conversational domains (e.g., teacher-student; physician-patient). A sample of nine conversations from two information seeking contexts (i.e., school library media center, public library) is used to test the utility of the analytical framework and explore possible characteristics of information seeking conversations. The findings support the utility of the framework for various purposes including: training of information specialists, feedback on their performance, design of human-computer dialogues, elicitation of decision making processes during information seeking, and support for natural language processing."
Listed below are some of the resources I have run across while trying to educate myself on the basics of KDD/IR (especially concerning description/discovery of Web resources). Some of the tutorials might prove helpful when tackling the assigned class readings on KDD/IR (eons from now).
KDD Glossaries
Machine Learning Glossary of Terms
Special Issue on Applications of Machine Learning and the Knowledge Discovery Process
http://robotics.stanford.edu/~ronnyk/glossary.html
Machine Discovery Terminology
compiled by W. Kloesgen and J. Zytkow
http://orgwis.gmd.de/projects/explora/terms.html
Data Mining Glossary from Two Crows -
http://www.twocrows.com/glossary.htm
Datawarehouse Terminology
by Creative Data:
http://www.credata.com/research/terminology.html
Introductory material:
KDD -
Knowledge Discovery In Databases: Tools and Techniques
by Peggy Wright
http://www.acm.org/crossroads/xrds5-2/kdd.html?ROLES=0PSA0STA0EMA&DOMAIN=.acm.org
Data Mining -
Introduction to Data Mining and Knowledge Discovery. 3rd Ed. Published by Two Crows Corporation
http://www.twocrows.com/intro-dm.pdf
Web Resources IR -
Practical Issues for Automated Categorization of Web Sites
by John M. Pierre, Metacode Technologies, Inc.
September 2000
http://www.ics.forth.gr/isl/SemWeb/proceedings/session3-3/html_version/semanticweb.html
Info on DAML+OIL from daml.org:
Tutorials on DAML+OIL from xml.com:
http://www.xml.com/pub/a/2002/01/30/daml1.html
http://www.xml.com/pub/a/2002/03/13/daml.html
Basic basics on Ontology Inference Layer (OIL):
http://www.ontoknowledge.org/oil/
And, of course, more on Web Ontology Language (OWL):
http://www.w3.org/TR/2002/WD-owl-guide-20021104/#Abstract
Current work on Web Resource representation and IR:
For an overview of clickstream analysis of Web activity:
INFORMATIONWEEK.com News, March 12, 2001
Pan For Gold In The Clickstream
http://www.informationweek.com/828/prmining.htm
Using Topic Maps for Web Resources description and IR:
http://www.xml.com/pub/a/2002/09/11/topicmaps.html?page=1
Project Aristotle(sm): Automated Categorization of Web Resources, is a clearinghouse of projects, research, products and services that are investigating or which demonstrate the automated categorization, classification or organization of Web resources. A working bibliography of key and significant reports, papers and articles, is also provided. Projects and associated publications have been arranged by the name of the university, corporation, or other organization with which the principal investigator of a project is affiliated.
http://www.public.iastate.edu/~CYBERSTACKS/Aristotle.htm
An online textbook for those who REALLY want to get into the nitty gritty of Information Retrieval:
INFORMATION RETRIEVAL, 2nd Ed (1999). by C.J. van Rijsbergen
Department of Computing Science, University of Glasgow:
http://www.dcs.gla.ac.uk/~iain/keith/
Slashdot pointed me to this fascinating piece by James Grimmelman on LawMeme (which I've never followed before but will watch in the future):
Accidental Privacy Spills: Musings on Privacy, Democracy, and the Internet
The article discusses the spread of a personal email by Laurie Garrett, a journalist attending on the January World Economic Forum in Davos. Particularly, iit addresses Garrett's (justifiable?) anger at learned that her "private" email had been forwarded without her permission by someone in her circle of trust and had subsequently been discussed by "techno-liberalists" on lists such as MetaFilter.
All in all, I think that this piece ties together a lot of the themes we've discussed so far in class.
Yesterday there was an interesting day of "Show and Tell" that was co-hosted by the College of Communication / School of Information / Deptartment of Electrical Engineering / Department of Computer Science. I am not sure why the Scool of Information didn't make this day more apparent to its students but I will be keeping my eyes and ears open for future forums of this sort.
Speakers (including our own Turnbull, Bias, and Chen) spoke on topics ranging from "Autonomous Learning Agents in Dynamic, Multiagent Environments" to "Educational Content in Video Games" to "Data Mining for Informational Retrieval" ...
What was great about this forum is that innovative, dynamic research is being done across campus and it appears that professors and students are recognizing the importance of sharing knowledge throughout the university community.
The speaker, Joydeep Ghosh, (his website) who discussed his research on "Data Mining for Info. Retrieval" was particularly timely to our reading of "Lifestreams ..." because he is researching the idea that "Personal data should be accessible anywhere and compatibility should be automatic" (81). A great deal of the advancement of personal information management seems to be locked in the future of creating intelligent agents that will be able to "interact" with users by being able to differentiate the levels of importance that users attach to the information they encounter.
--- Is a website important because the user accesses it frequently or do they access it frequently simply because it is a default? What banking information is really necessary ... can certain things be hidden, like investment notations, and just appear as a calendar reminders? ----
< short rant >
Presently, I stay away from a lot of PIM programs and devices because I have to go through so many steps to program in my preferences, schedules, etc. What I would use is something that might not "be smarter" than me but that would give me a confidence in that it appears to be ... maybe that is a naive statement but I think that if I purchase a device to be my "assistant" then it should be able to be as assistant-like as possible. Afterall, if a real-life assistant couldn't keep all of a boss' comings and goings straight they'd probably be fired!
< / rant >
To conclude, there appears to be some truly innovative research happening (and/or) on the brink of happening at UT in the fields of knowledge management, data mining, and general information goodness. As a graduate student it is comforting to learn about the various groups on campus coming together because it means opportunities are ripe for fututre research.
I've been doing some background reading on knowledge discovery and information retrieval and I find myself lost in the mists of time. The importance of writing to human memory is something I had thought about before of course, but I had not thought of it in the context of early pictographic writing and how it was used to record data and establsh collective knowledge bases. The idea being that an individual's knowledge was no longer bounded by what he/she could remember.