A surrogate key is typically a numeric value. Within SQL Server, Microsoft allows you to define a column with an identity property to help generate surrogate key values. The PRIMARY KEY constraint uniquely identifies each record in a database table. Primary keys must contain UNIQUE values. A primary key column cannot contain NULL values. Jun 20, 2014 The database system may manage the surrogate key generation and most often the key is of a numeric type (e.g. Integer or bigint), is incremented whenever there is a need for a new key. If we want to control the surrogate key generation we can employ a 128-bit GUID or UUID. This simplifies batching and may improve the insert performance since.
A surrogate key (or synthetic key, entity identifier, system-generated key, database sequence number, factless key, technical key, or arbitrary unique identifier[citation needed]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.[1]
There are at least two definitions of a surrogate:
The Surrogate (1) definition relates to a data model rather than a storage model and is used throughout this article. See Date (1998).
An important distinction between a surrogate and a primary key depends on whether the database is a current database or a temporal database. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.
Although Hall et al. (1976) say nothing about this, others[specify] have argued that a surrogate should have the following characteristics:
In a current database, the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated UUID (for example, an HR number for each employee other than the UUID of each employee).
A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server 'identity column', a PostgreSQL or Informixserial
, an Oracle or SQL ServerSEQUENCE
or a column defined with AUTO_INCREMENT
in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID
or SQL Server UNIQUEIDENTIFIER
).
Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile) and guarantees uniqueness.
In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled world. One table row represents a slice of time holding all the entity's attributes for a defined timespan. Those slices depict the whole lifespan of one business entity. For example, a table EmployeeContracts may hold temporal information to keep track of contracted working hours. The business key for one contract will be identical (non-unique) in both rows however the surrogate key for each row is unique.
SurrogateKey | BusinessKey | EmployeeName | WorkingHoursPerWeek | RowValidFrom | RowValidTo |
---|---|---|---|---|---|
1 | BOS0120 | John Smith | 40 | 2000-01-01 | 2000-12-31 |
56 | P0000123 | Bob Brown | 25 | 1999-01-01 | 2011-12-31 |
234 | BOS0120 | John Smith | 35 | 2001-01-01 | 2009-12-31 |
Some database designers use surrogate keys systematically regardless of the suitability of other candidate keys, while others will use a key already present in the data, if there is one.
Some of the alternate names ('system-generated key') describe the way of generating new surrogate values rather than the nature of the surrogate concept.
Approaches to generating surrogates include:
IDENTITY
OR IDENTITY(n,n)
SEQUENCE
, or GENERATED AS IDENTITY
(starting from version 12.1)[3]SEQUENCE
(starting from SQL Server 2012)[4]AUTO_INCREMENT
AUTOINCREMENT
AS IDENTITY GENERATED BY DEFAULT
in IBM DB2Surrogate keys do not change while the row exists. This has the following advantages:
Attributes that uniquely identify an entity might change, which might invalidate the suitability of natural keys. Consider the following example:
In these cases, generally a new attribute must be added to the natural key (for example, an original_company column).With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key. Surrogate keys avoid choosing a natural key that might be incorrect.
Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Furthermore, a non-redundant distribution of keys causes the resulting b-tree index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than compound keys.
While using several database application development systems, drivers, and object-relational mapping systems, such as Ruby on Rails or Hibernate, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.
When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.
It is possible to design key-values that follow a well-known pattern or structure which can be automatically verified. For instance, the keys that are intended to be used in some column of some table might be designed to 'look differently from' those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of Database normalization.
The values of generated surrogate keys have no relationship to the real-world meaning of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. If appropriate database constraints have not been set, or data imported from a legacy system where referential integrity was not employed, it is possible to have a foreign-key value that does not correspond to a primary-key value and is therefore invalid. (In this regard, C.J. Date regards the meaninglessness of surrogate keys as an advantage. [5])
To discover such errors, one must perform a query that uses a left outer join between the table with the foreign key and the table with the primary key, showing both key fields in addition to any fields required to distinguish the record; all invalid foreign-key values will have the primary-key column as NULL. The need to perform such a check is so common that Microsoft Access actually provides a 'Find Unmatched Query' wizard that generates the appropriate SQL after walking the user through a dialog. (It is, however, not too difficult to compose such queries manually.) 'Find Unmatched' queries are typically employed as part of a data cleansing process when inheriting legacy data.
Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by NOT exporting surrogate keys, except as transient data (most obviously, in executing applications that have a 'live' connection to the database).
When surrogate keys supplant natural keys, then domain specific referential integrity will be compromised. For example, in a customer master table, the same customer may have multiple records under separate customer IDs, even though the natural key (a combination of customer name, date of birth, and E-mail address) would be unique. To prevent compromise, the natural key of the table must NOT be supplanted: it must be preserved as a unique constraint, which is implemented as a unique index on the combination of natural-key fields.
Relational databases assume a unique index is applied to a table's primary key. How do i generate a public ssh key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a candidate key, the index can be a unique index.
These additional indexes, however, will take up disk space and slow down inserts and deletes.
Surrogate keys can result in duplicate values in any natural keys. To prevent duplication, one must preserve the role of the natural keys as unique constraints when defining the table using either SQL's CREATE TABLE statement or ALTER TABLE ..ADD CONSTRAINT statement, if the constraints are added as an afterthought.
Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.
Proprietary information can be leaked if sequential key generators are used. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. There are a few ways to overcome this problem:
Sequentially generated surrogate keys can imply that events with a higher key value occurred after events with a lower value. This is not necessarily true, because such values do not guarantee time sequence as it is possible for inserts to fail and leave gaps which may be filled at a later time. If chronology is important then date and time must be separately recorded.
journal=
(help)Posted Feb 28, 2011
By Gregory A. Larsen
In my last article I talked about the difference between surrogate keys and natural keys. In that article I discussed how surrogate keys are made up keys, meaning they do not appear naturally in the data. In this article I will be showing you how to generate those surrogate keys using an identity column. I will be exploring what is an identity column, how to define an identity column and the different methods of populating an identity column.
An identity column is a single column in a table that has its identity column property set. A table doesn't need to have an identity column. When a table has an identity column, that column is automatically populated with an integer value every time a new row is added to the table; more on this is a minute. The value of an identify column is based on a seed and increment value that is associated with the identify column; more detail on this further down in this article.
An identity column property can only be set on columns that are declared as a decimal, int, numeric, smallint, bigint, or tinyint. If the identity property is associated with a numeric or decimal, the scale must be set to 0. When you set the identity property, there are two components of that property: seed and increment. Additionally, the column must be defined to not allow NULL values to be inserting into it. You can set up an identity column when you declare a table, or you can set up an identify column on an existing table column by altering the column properties.
When you create a table you can define the identity column. You can also add an identity column to a pre-existing table; more on that later. To define an identity column when you create a table you just need to set the IDENTITY property on the CREATE TABLE statement. Here is an example:
Above I created a column called 'ID' that is my IDENTITY column. Note I specified 'IDENTITY(1,1).' The '1,1' notation specifies the 'seed' and 'increment' value for the identity column. The 'seed' value is used to set the value of the ID column for the first row inserted into the table. The 'increment' value is used to populate the identity column value for additional rows added to the table, by adding this value to identity column value of the previously inserted row. The 'seed' and 'increment' values need to be an integer, both positive and negative values are allowed. In my example above I said I wanted my first row inserted to have an identity column value of 1. The second inserted row would have an identity column value of 2, and so on and so forth.
You can also create an identity column when creating a table using a SELECT statement with an INTO table clause. To do this you use the IDENTITY function. The IDENTITY function has the following syntax:
Where data_type is one of the valid identity column data types listed above, seed is the identity column value for the first row added, increment is an integer value that is added to the identity column value of the prior inserted row and column_nameis the name of the IDENTITY column to be created.
Here is an example of how to create a new table that has an identity column using a SELECT .. INTO method:
Here I am using the SELECT..INTO syntax to create the table MyTableNew. To define my identity column I used the IDENTITY function to define an integer column where the identity properties have a seed value of 1 and an increment value of 1.
Occasionally you might find you need to add the identity property to an existing column in an existing table, or adding a new identity column to an existing table. Let me explore how to do this, and the issues you might run into.
First let's talk about altering a table to add an identity column to an existing table. By adding an identity column, I mean adding a brand new column to a table. To do that you need to alter the table definitions. Let's assume I have the following table definition:
For this example, assume that this table already has 39 different rows in this table, where the County Code contains abbreviation of the County name to uniquely identify each row, the ReferenceID is basically row number that is manually populated, and the CountyName the spelled out the name of the county. Say I decided I wanted to put a surrogate key column on this table that is an INT and populate it using the IDENTITY property. To do that I would just need to run the following ALTER TABLE statement:
Upon executing this ALTER statement, SQL Server will first alter the table adding the CountyID column. Then once the column is added SQL Server will number all the existing rows automatically based on the identity property.
Assume I want to set the identity property of my existing ReferenceID which has already been populated manually with a row number. There is no simple one statement method to accomplish this. Instead I have to jump through a number of hoops to do this.
Assume my original table above looked like this:
Where I have 39 existing records populated in this table, where each row has a unique reference number that has been set manually. Assume for now there are no constraints on this table. In order to make the ReferenceID my identity column, I would first need to rename the table to say to something like dbo.CountyOld. Then I could create my new County table using the following code, which sets the ReferenceID as an identity:
After this I would set the IDENTIFY_INSERT ON (more on this in the next section) for this table. Then run the following code:
After the INSERT statement was done running, I would turn the IDENTITY_INSERT OFF for this table, and then drop the dbo.CountyOld table. If I had constraints on my table I would have to take the necessary actions to drop and recreate those constraints.
Alternatively, I can use the 'Design' feature of a table in SQL Server Management Studio to set the identity properties on an existing table. Using SQL Server Management Studio, perform similar steps as I described above.
When you have a table with an identity column there are things you need to think about when inserting records into these tables. Let me go through a couple of INSERT statements to describe how inserting records is done.
First, let me talk about how to insert records where the identity column is populated automatically using the identity properties. Remember the table dbo.MyTable that I created above, it had three columns — ID, MyShortDesc, and MyLongDesc — where the identity property was set on the IDcolumn. This is the table I will be using for my example, and here is an INSERT statement that adds a new row to this table:
In this example I specified the column names I was populating with values in the dbo.MyTable by placing those columns inside parenthesis immediately following dbo.MyTable. Note how I didn't specify the identity column ID. I didn't have to include this column in my INSERT statement because it will automatically be populated using the identity property setting associated with this column. Another way to write this insert statement is like this:
Here I left off the column names following the table name dbo.MyTable. I was able to do that because SQL Server knows the only other column on this table is the identity column, and it knows how to populate the value for that column.
What if I wanted to set the identify column value myself on the INSERT statement. How is this done? As it turns out this isn't as simple as one might think. I CANNOT just execute this code:
If I try to run an INSERT statement similar to this where I try to identify a value of the identity column I would get this error:
This error message tell me I need to set the IDENTITY_INSERT value to ON if I wanted to explicitly set the identity value. Let's try this again and set the IDENTITY_INSERT value to ON by using this code:
By using the SET statement to set the IDENTITY_INSERT option to ON, it allows me to set the identify column ID to a value '12.' Keep in mind that you can only have the IDENTIFY_INSERT value turned on for only one table at a time in a session. Also, when you have IDENTITY_INSERT on you are able to insert multiple rows with the same identity column value, provided you don't have a constraint that restricts duplicate values in your identity column. You can also insert rows that have an identity column value greater than the last identity column value created. This will leave holes in your identify column values and will also set the value SQL Server is keeping that helps it determine the next identify value. Once you are done inserting rows, where you are setting the identity column value, you should turn off the IDENTITY_INSERT option by running the following command:
You might be wondering what happens with identity column values when you delete a record in a table that has an identity column. When rows are deleted, the identity values are not reused. Therefore, over time you will have gaps in your identity column values based on the records that have been deleted. If this is a problem for your situation, you might consider using a trigger to populate a sequential number column instead of using an identity column.
Identity columns make it easy to have surrogate key columns that are automatically populated. Having a column be populated by the identity property also makes it easy to create unique identity column values for each row. Next time you want a surrogate key when you design a table, consider creating the key as an identity column.
Latest Forum Threads | |||
MS SQL Forum | |||
Topic | By | Replies | Updated |
SQL 2005: SSIS: Error using SQL Server credentials | poverty | 3 | August 17th, 07:43 AM |
Need help changing table contents | nkawtg | 1 | August 17th, 03:02 AM |
SQL Server Memory confifuration | bhosalenarayan | 2 | August 14th, 05:33 AM |
SQL Server – Primary Key and a Unique Key | katty.jonh | 2 | July 25th, 10:36 AM |