views:

83

answers:

5

I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.

That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?

+4  A: 

The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.

Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].

Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.

The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.

Example:

  • Raw stock exchange data from Wall Street is ... Data
    A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.

  • A spreadsheet made from the above is ... Information
    Various processes were applied to the data:
    -cleaning / correcting various values
    -cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
    -merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
    -aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
    All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
    Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.

  • An financial analyst's report may contain ... knowledge
    For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.

By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because

  • entropy decrease is not directly or systematically tied to "usefulness for human"
    (a typical example is that a zipped text file has less entropy yet is no fun reading)
  • there is effectively a loss of information (in addition to entropy loss)
    (for example when data is aggregate the [IT sense] information about individual records get lost)
  • there is, particular in the case of Information -> Knowlege, a change in level of abstration

A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.

mjv
But the financial report, say a ppt file, will itself be raw data for say, as spell checker. The spell check will produce information, and the human will interpret it and gain knowledge, e.g. "I make always this or that mistake". It's meta-circular :)
ewernli
@ewernli. Excellent point. I meant to gloss this over, having readily written a possibly confusing "manifesto" on the topic, but your comment prompted me to briefly discuss the relativity of purpose. Thank you !
mjv
? Would be nice to see what prompted the -1... I may well have misrepresented something, or possibly being totally wrong on something else; would be nice to know...
mjv
'entropy' and 'spreadsheet' used together? Automatic -1. There are other good reasons for down vote here, but I can only vote once.
ima
Lets be more specific. This answer assumes that information is processed by a human, who first sees data, then understands information and applies knowledge. Such anthropocentric view is misleading, because human brain is no different from the said spell checker, as far as we talk about fundamental concepts.
ima
And information processing by spell-checker, data-mining application, different kinds of machine-learning systems, etc - all follow different patterns, and neither fits into progression you describe.
ima
@ima: thank you for this; I think I understand your point of view, now. I can't deny the anthropocentric stance of my expose. I believe that the labels of "data", "information" and "knowledge" which we use to describe the stages of "refinement" of [IT-sense] information are useful and effectively mirror the way humans process the complexity which surrounds them. This hierarchical categorization process, while probably not universal, also applies to other natural systems (even non biologic ones), and to a lesser extent to software...
mjv
(cont.) Also, while fully acknowledging the fuzzy, imprecise and relative nature of the data/information/knowledge "categories", (BTW most classification processes imply such gradual/fuzzy boundaries between categories, and most are also subjective to the underlying use), I do not agree with your characterization of these definitions as "shoddy philosophical musings, pretending to have deep meaning yet lacking precise meaning at all." This classification is practical, are is also quite insightful (deep) with regards to the fundamental nature of things.
mjv
And if we talk specifically about human learning and insight: data-info-knowledge mantra is a trivial, crude and incomplete. All the experience we have in education, training, all the advances in biology, in neuroscience - for this? It's an insult to humanity to call this "fundamental".
ima
Implication from false premise is always correct - likewise, drawing generalization from a principle, which doesn't work even for humans, might well seem deep and insightful. But: junk in, junk out. It is either simply a mistake, or a dishonest philosophical twisting.
ima
@ima, I truly enjoy this conversation, and I'm _genuinely_ interested in these matters. No matter how misinformed my premises or how flawed my reasonnning may be, I'm intellectually and honestly vested in the argument;be assured that I'm not trying to twist things around for benefit of the debate team ;-) So, if you are so inclined, let's continue the chat! I'm afraid this is straying from clean, objective advice on programming matters for which SO was invented... And that's probably why the comment feature of SO wasn't made too confortable ;-)
mjv
`"a principle which doesn't even work for humans"`... Maybe our disagreement stems from the _scope_ of my use of the data-info-knowledge model. This [principle] works -quite well!- for humans, but, yeah, probably fails short of explaining the whole human experience w/ regards to knowledge, intelligence, epistemology...The definitions of the concepts data/info/knowledge I provided, and my placing them on a hierarchical scale, illustrated with a human centric example, never meant to imply a reductive view of humanity's journey ;-) Yet, there's strong evidence that this type of ...
mjv
mjv
(end) "measured" by the reduction of "irrelevant" information and the introduction of information/messages which sometimes convey a abstraction of the original information.
mjv
Or shorter: "processing information includes analysis and inference". Hard to argue, but...
ima
Got me! I was full of it all along! :-( I'd rather think that I merely confused a "universal" observation (or in your concise version, a circular definition) with a truly insightful bit, but that's because I got to live with myself ;-)
mjv
I edited the answer to stress, from the onset, the subjective meanings of data/info/knowledge. With such clarification, I like to think the answer is useful and addresses the OP's question.
mjv
+1  A: 

Define information and data first, very carefully.

What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.

Sometimes just adding the right context changes data to information.

So, to answer you question: No, information is not a subset of data. It could be at least the following.

  1. A superset, when you add context

  2. A subset, needle-in-a-haystack issue

  3. A function of the data, e.g. in a digest

There are probably more situations.

John Smith
I would go even further and call data-information-knowledge classification a shoddy philosophy, pretending to have deep meaning yet lacking precise meaning at all. Most often it is used to conceal lack of real understanding of information processing and decision making.
ima
@ima John Smith makes an excellent point regarding the importance of context in qualifying data vs. info vs. insight. Yet, and for all relative (to the context) these three concepts (let's even call these categories, since you use the word classification) maybe be, they are very useful and real, and don't merely serve to help people fake some genuine understanding of information processing...
mjv
Your answer is exactly the kind of shoddy philosophy I had in mind. Thanks for providing an example.
ima
A: 

Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.

Dan Bryant
+1  A: 

This is how I see it...

Data is dirty and raw. You'll probably have too much of it.

... Jason ... 27 ... Denton ...

Information is the data you need, organised and meaningful.

Jason.age=27
Jason.city=Denton

Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.

kiwicptn
So no, information is NOT a subset of data.
kiwicptn
+1  A: 

information is an enhancement of data:

  • data is inert
  • information is actionable

note that information without data is merely an opinion ;-)

Steven A. Lowe