What database are classified as data that have already been processed to some extent?

Database Systems

Abraham Silberschatz, ... S. Sudarshan, in Encyclopedia of Information Systems, 2003

C Integrity and Security

The data values stored in the database must satisfy certain types of integrity constraints (also known as consistency constraints). For example, the balance of a bank account may never fall below a prescribed amount (say, $25).

Database systems allow such constraints to be specified explicitly, and reject any updates to the database that cause the constraints to be violated. If a file-processing system were used instead of a database, such constraints can be ensured by adding an appropriate code in the various application programs. However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is compounded when constraints involve several data items from different files.

Not every user of the database system should be able to access all the data. For example, in a banking system, payroll personnel need to see only that part of the database that has information about the various bank employees. They do not need access to information about customer accounts.

Database systems allow explicit specification of which users are allowed what sort of access (such as read or update) to what data. If a file-processing system were used, application programs would require an extra code to verify if an attempted access is authorized. But, since application programs are added to the system in an ad hoc manner, enforcing such security constraints is difficult.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122272404000289

Introduction

Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

1.5.3 Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases for organizations and end-users. Particularly, database systems researchers have established highly recognized principles in data models, query languages, query processing and optimization methods, data storage, and indexing and accessing methods. Database systems are often well known for their high scalability in processing very large, relatively structured data sets.

Many data mining tasks need to handle large data sets or even real-time, fast streaming data. Therefore, data mining can make good use of scalable database technologies to achieve high efficiency and scalability on large data sets. Moreover, data mining tasks can be used to extend the capability of existing database systems to satisfy advanced users' sophisticated data analysis requirements.

Recent database systems have built systematic data analysis capabilities on database data using data warehousing and data mining facilities. A data warehouse integrates data originating from multiple sources and various timeframes. It consolidates data in multidimensional space to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining (see Section 1.3.2).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780123814791000010

Parallel Computing

Andreas Reuter, in Advances in Parallel Computing, 1998

1 Introduction

Database systems have exploited parallelism from the very beginning (in the early 60s), in a way similar to operating systems. An operating system allows many users to use the same hardware concurrently, i.e. in parallel, and a database system allows the same database to be concurrently accessed by many application programs. There is an important difference, though: Whereas the operating system gives each user its own private resources (processes, address spaces, files, sessions, etc.), the database is a shared object. Different applications can access the same records or fields concurrently, and the problem of automatically maintaining correctness in such a parallel execution environment was one of the great challenges in database research in the 70s. The problem was solved by the transaction concept with its sophisticated interplay of consistency checking, locking and recovery, and the resulting functionality proved an ideal basis for large distributed online information systems.

For a long time, this was the only use of parallelism in databases - to the exception of a few special systems. However, in the 80s users increasingly started developing applications that were radically different from classical online transactions: They ploughed through their databases (which kept growing rapidly) in order to identify interesting patterns, trends, customer preferences, etc. - the kind of analysis that is called “decision support”, “data mining”, “online analytic processing” or whatever.

Those applications use databases in a different way than online transactions. An online transaction is short; it typically executes less than one million instructions, touches less than 100 objects in the database, and generates less than 10 messages. However, there are many concurrent transactions of the same kind, each transaction wants to experience short response time, and the system should run at high throughput rates.

Decision support applications, on the other hand, touch millions of objects, execute 1012 instructions or more, and they typically do not execute in parallel to either each other or normal online transactions. Running such applications with reasonable elapsed times requires the use of parallelism in all parts of the database system, and the explanation of which options are available is the focus of this paper.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0927545298800051

Incorporating Uncertainty into Data Integration

AnHai Doan, ... Zachary Ives, in Principles of Data Integration, 2012

Database systems typically model only certain data. That is, if a tuple is in the database, then it is true in the real world. There are many contexts in which our knowledge of the world is uncertain, and we would like to model that uncertainty in the database and be able to query it. For example, consider situations like the following:

We are collecting information about theft sightings, and a witness does not remember whether the thief had blond or brown hair.

We are recording temperature measurements from sensors in the field, and the accuracy of the sensors has a certain error bound or measurements may even be completely wrong if a sensor's battery is low.

We are integrating data across multiple sites, and the answer to a query requires a join. This join includes an approximate string matching condition, which may yield a certain number of false positives.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124160446000132

Database Development Process

Ming Wang, Russell K. Chan, in Encyclopedia of Information Systems, 2003

III.A. History of Database Development

Database systems evolved from file processing systems. As database management systems become more and more popular, file processing systems are slowly fading from the scene. This is because database management systems are superior to file processing systems in the following ways

Data dictionary management

Data storage management

Security management

Multi-user access control

Data integrity management

Backup and recovery management

III.A.1. Hierarchical DBMS

In the late 1960s, IBM developed the first commercial hierarchical DBMS information management system (IMS), and its DL/1-language. The principles behind the hierarchical model are derived from IMS. Hierarchical DBMSs organize their record types in a hierarchical tree. This approach is well suited for large systems containing a lot of data. Hierarchical databases support two types of information—the record type which is a record containing data, and parent-child relations (PCR) which define a 1:N relationship between one parent record and N child-records. Hierarchical DBMS retrieve data fast because all the paths that link record types are predefined. Hierarchical databases are still used in many legacy systems and IMS is still the leading hierarchical DBMS used by a large number of banks, insurance companies, and hospitals as well as several government agencies.

III.A.2. Network DBMS

The original network model and language were presented in the CODASYL Database Task Group's 1971 report; hence it is also called the DBTO model. Revised reports in 1978 and 1981 incorporated more recent concepts. The network DBMS offers an efficient access path to its data and is capable of representing almost any informational structure containing simple types (e.g., integers, floats, strings, and characters). The network DBMS is more flexible than the hierarchical DBMS because it represents and navigates among 1:1, 1:N, and M:N relationships. But the programmer still has to know the physical representation of data to be able to access it. Accordingly, applications using a network database have to be changed every time the structure of the database changes, making database maintenance tedious. Both the hierarchical and network DBMSs were accessible from the programming language (usually COBOL) using a low-level interface.

III.A.3. Relational Databases

In 1970 Edgar F. Codd, at the IBM San Jose Research Laboratory, proposed a new data representation framework called the relational data model which established the fundamentals of the relational database model. Codd suggested that all data in a database could be represented as a tabular structure (tables with columns and rows, which he called relations) and that these relations could be accessed using a high-level nonprocedural (or declarative) language. Instead of writing programs to access data, this approach only needed a predicate that identified the desired records or combination of records. This would lead to higher programmer productivity and in the beginning of the 1980s several relational DBMS (RDBMS) products emerged (e.g., Oracle, Informix, Ingres, and DB2).

The relational approach separates the program from the physical implementation of the database, making the program less sensitive to changes of the physical representation of the data by unifying data and metadata in the database. SQL and RDBMSs have become widely used due to the separation of the physical and logical representation (and of course to marketing). It is much easier to understand tables and records than pointers to records. Most significantly, research showed that a high-level relational query language could give performance comparable to the best record-oriented databases

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122272404000265

The Odds Are Against You

Dennis Nils Drogseth, ... Dan Twing, in CMDB Systems, 2015

No Resources and High Costs

If the CMDB System seems like an impossible dream, doesn't devoting largely nonexistent IT resources sound, at best, problematic? Optimizing resources is one of the reasons we advocate proceeding in phases that bring relatively quick value and help to reinforce CMDB System benefits within three to six months:

“We are currently at 100%. The CMDB will just be competing for resources we don't have.”

“Some people here think you can just “pop” the CMDB in. But I see it as a resource drain.”

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128012659000019

Data Analysis Techniques

Qamar Shahbaz Ul Haq, in Data Mapping for Data Warehouse Design, 2016

Aggregate Functions

Sometimes we need aggregate functions for data analysis. These functions can be used for verifying individual column values or for comparing source data with another system to confirm aggregated values.

Different database systems have different aggregate functions, but all provide basic ones. Below are the ones that are most commonly used during data analysis.

COUNT: This function counts the number of occurrences of a column(s).

SUM: This function adds all values of the column.

MIN: This function returns minimum value from the column.

MAX: This function returns maximum value from the column.

These functions are usually used with other SQL features to get the desired resultset. For example, if you want to see how much sales volume was generated for a certain type of product, then you can use a CASE statement with the SUM function:

SUM

 (

  CASE WHEN Product_Type = 1 THEN Aount ELSE 0 END

 )

COUNT(*) gives a count of all rows in a table, we can add DISTINCT on column to get count of unique values of the column

COUNT (DISTINCT employee_id)

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128051856000101

Metadata Management

Daniel Linstedt, Michael Olschimke, in Building a Scalable Data Warehouse with Data Vault 2.0, 2016

10.2.11 Capturing Access Control Lists and Other Security Measures

Most database systems allow the definition of users, roles and access control lists (ACL) on individual databases, tables and views of the data warehouse. However, in order to effectively secure the data in the data warehouse, several actions are required to be performed [12]:

Identify the data: this step is already complete when creating the suggested metadata described in the previous sections.

Classify data: for security purposes, the initial metadata used in step 1 and described in the previous sections should be modified to include a data sensitivity attribute.

Quantify the value of data: in order to estimate the costs of security breaches, the value of data assets needs to be quantified first. It might be hard to determine this value, especially because errors in the data warehouse might lead to erroneous business decisions.

Identify data security vulnerabilities: it is also hard to determine the potential vulnerabilities in a data warehouse, due to the long-lifespan of the data storage and unknown future use of the data.

Identify data protection measures and their costs: for each thread identified in the previous action, the potential remedies are identified and priced.

Selecting cost-effective security measures: the value of the data and the severity of the thread are used to level the identified security measures.

Evaluating the effectiveness of security measures: the effectiveness of the security measures needs to be addressed in the final step.

Without going into more detail regarding these activities, it becomes possible to identify the following metadata attributes that are required to define the security measures:

Data sensitivity: the classification of the data regarding security as defined in step 2. Table 10.14 lists and describes an example of a data classification matrix.

Table 10.14. Example Data Classifications [13]

ProhibitedRestrictedConfidentialUnrestricted
Classification Guideline Data is classified as prohibited if required by law or any other regulation or if the organization is required to self-report to the government or the individual if data has been inappropriately accessed. Data is classified as restricted if the data would qualify as prohibited but the organization determines that personnel could not effectively work with the data when not stored in the data warehouse. Data is classified as confidential if it is not classified as prohibited or restricted. Exceptions to this rule are listed under Classification of Common Data Elements. Data is classified as unrestricted if not classified as prohibited, restricted, or confidential.
Classification of Common Data Elements Checking account numbers Export controlled data Contract data Data authorized to be available in public
Credit card numbers Health information Emails Data in the public domain
Driver license numbers Passport numbers Financial data
Health insurance policy numbers Visa numbers Internal memos
Social security numbers Personnel records
Access Protocol Access only with permission and with signed non-disclosure agreements (NDA). Access restricted only to those permitted under law, regulations or organizational policies and with a need to know for carrying out their assigned work and duties. Access restricted to user with a need to know and with the permission of the data owner or custodian. Access to unrestricted data is granted to everyone.
Transmission Approved encryption is required in order to transmit the data through a network. Approved encryption is required in order to transmit the data through a network. Approved encryption is recommended in order to transmit the data through a network. No encryption is required to transmit unrestricted data through a network.
Storage It is not allowed to store the data in any computing environment, including the data warehouse, unless especially approved by the organization. If it is explicitly allowed, approved encryption is required. It is not allowed for a third party to process or store the data without explicit approval. Approved encryption is required to store the data in the data warehouse. It is not allowed for a third party to process or store the data without explicit approval. Encryption is recommended but not required. The data owner might decide on required encryption. Data might be processed or stored by a third party if a valid contract with the party exists. No encryption is required.

Data value: the value of the data as quantified in step 3.

Potential remedies per security vulnerability and data item: the potential remedies as identified in step 4.

Thread severity: the severity of the thread as identified in step 4.

Table 10.14 provides an example of a data classification matrix that can be used to classify the raw data in the data warehouse:

While Table 10.14 provides a list of data classifications that can be used in data warehouse environments, it is still unclear how these levels can be applied to a Data Vault 2.0 model, among the other metadata attributes.

Classified data might be stored in all Raw Data Vault entities. Business keys might be classified as Table 10.14 shows: a credit card number or social security number by itself is already considered as prohibited and should only be stored in a data warehouse if approved by the organization. In that case, the hub that holds this business key should be elevated to the appropriate sensitivity level, restricted (because the organization allowed the storage of the data in the data warehouse). The same applies to satellites which hold descriptive data that might be classified as restricted or confidential. A best practice is to separate the data by sensitivity level. This can be done using satellite splits. For each resulting satellite, the database engine can enforce separate security controls.

However, the data is not only stored in Raw Data Vault entities. There are dependent entities in the Business Vault and in information marts that provide access to information which is based on the raw data provided by the Raw Data Vault, either in a virtual or materialized form. In both cases, the information should be classified to the same sensitivity level as the underlying raw data. When information is generated from multiple entities in the Raw Data Vault, each of them classified with different sensitivity levels, the most restrictive sensitivity level should be used for the derived information as a whole. For example, if a computed satellite in the Business Vault consolidates raw data from a satellite classified as unrestricted and another satellite classified as confidential, the computed satellite should be classified as confidential as well.

Such a classification of derived entities is not necessary if the grain of data is modified, for example during an aggregation. In this case, the classification of the computed satellite might be actually less restrictive because detailed information is not available anymore [14].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128025109000106

Databases

Alvaro A.A. Fernandes, Norman W. Paton, in Encyclopedia of Physical Science and Technology (Third Edition), 2003

IV.A Deductive Databases

Deductive database systems can be seen as bringing together mainstream data models with logic programming languages for querying and analyzing database data. Although there has been research on the use of deductive databases with object-oriented data models, this section illustrates deductive databases in the relational setting, in particular making use of Datalog as a straightforward deductive database language (Ceri et al., 1990).

A deductive (relational) database is a Datalog program. A Datalog program consists of an extensional database and an intensional database. An extensional database contains a set of facts of the form:

p(c1,…,cm)

where p is a predicate symbol and c1,…,cm are constants. Each predicate symbol with a given arity in the extensional database can be seen as analogous to a relational table. For example, the table organism in Fig. 2 can be represented in Datalog as:organism('Bras. napus', 'rape', 'Brassicaceae',).organism ('Trif. repens', 'white clover', 'Papilionoideae',).

Traditionally, constant symbols start with a lower case letter, although quotes can be used to delimit other constants.

An intensional database contains a set of rules of the form:

p(x1,…,xm):−q1(x11,…,x1k),…,qj(xj1,…,xjp)

where p and qi are predicate symbols, and all argument positions are occupied by variables or constants.

Some additional terminology is required before examples can be given of Datalog queries and rules. A term is either a constant or a variable—variables are traditionally written with an initial capital letter. An atom p(t1,…,tm) consists of a predicate symbol and a list of arguments, each of which is a term. A literal is an atom or a negated atom ¬p(t1,…,tm). A Datalog query is a conjunction of literals.

Figure 16 shows some queries and rules expressed in Datalog. The queries DQ1 and DQ2 are expressed using a set comprehension notation, in which the values to the left of the ∣ are the result of the Datalog query to the right of the ∣.

What database are classified as data that have already been processed to some extent?

FIGURE 16. Using datalog to query the database state in Fig. 8.

Query DQ1 retrieves the organism identifier of ‘white clover’. Informally, the query is evaluated by unifying the query literal with the facts stored in the extensional database. Query DQ2 retrieves the identifiers of the proteins in ‘white clover’. However, the predicate species_protein representing the relationship between the common name of a species and the identifiers of its proteins is not part of the extensional database, but rather is a rule defined in the intensional database. This rule, DR1 in Fig. 16, is essentially equivalent to the SQL query RQ2 in Fig. 9, in that it retrieves pairs of species common names and protein identifiers that satisfy the requirement that the proteins are found in the species. In DR1, the requirement that a dna_Sequence be found in an organism is represented by the atoms for dna_Sequence and organism sharing the variable OID, i.e., the organism identifier of the organism must be the same as the organism identifier of the dna_Sequence.

Deductive databases can be seen as providing an alternative paradigm for querying and programming database applications. A significant amount of research has been conducted on the efficient evaluation of recursive queries in deductive languages (Ceri et al., (1990), an area in which deductive languages have tended to be more powerful than other declarative query languages. Recursive queries are particularly useful for searching through tree or graph structures, for example in a database representing a road network or a circuit design. A further feature of deductive languages is that they are intrinsically declarative, so programs are amenable to evaluation in different ways—responsibility for finding efficient ways of evaluating deductive programs rests as much with the query optimizer of the DBMS as with the programmer.

Although there have been several commercial deductive database systems, these have yet to have a significant commercial impact. However, some of the features of deductive languages, for example, relating to recursive query processing, are beginning to be adopted by relational vendors, and are part of the SQL-99 standard.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0122274105008449

RDF Dictionaries: String Encoding

In RDF Database Systems, 2015

4.1 Encoding motivation

RDF database systems frequently try to store a database in a compact way. A common and basic approach consists in providing an integer value to each URI, blank node, or literal. The link between original values and encoded ones is usually stored in a data structure called a dictionary.

The underlying basic concept of a dictionary (which is not specific to the RDF model) is to provide a bijective function mapping long terms (e.g., URIs, blank nodes, or literals) to short identifiers (e.g., integers). More precisely, a dictionary should provide two basic operations: string-to-id and id-to-string (also referred to in the literature as locate and extract operations). In a typical use, a SPARQL engine will call string-to-id to rewrite the user query into a list to match the data encoding, while id-to-string will be called to translate the result into the original format. One can easily admit that this later operation will be more often called because the number of terms to encode in a query will usually be smaller than the number of terms to be translated in the result. As a consequence, different approaches in the dictionary will be used for these two operations.

Considering the approach, one will end up with a “new” data set composed of the dictionary and the encoded data set. This new data set will provide a high-compression view of the data (Neumann and Weikum, 2010). Nevertheless, the size of the dictionary is not negligible at all, and may even be larger than the encoded data set itself, as pointed out by Martínez-Prieto et al. (2012), and thus may induce scalability issues by its own. Overall, an RDF dictionary must be optimized regarding two conflicting perspectives: minimizing its memory footprint and minimizing the answering time of string-to-id and id-to-string operations.

While, due to data heterogeneity, a constant compression ability cannot be achieved (Fernandez et al., 2010), some common features between data sets can be derived. Indeed, usually in a data set the proportion of predicates is often a small fraction of the dictionary. Moreover, one can achieve an overall compression improvement by uniquely encoding any term appearing both as the subject and the object. In the following, we present different approaches used in research and commercial solutions.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780127999579000043