CS502 - Computing and Communication Technology
Fall-2012
Classes: MW 7:20 pm - 8:35 pm, Maria Sanford Hall 204
Instructor: Dr. Zdravko Markov, MS 307, (860)-832-2711, http://www.cs.ccsu.edu/~markov/,
e-mail: markovz at ccsu.edu
Office hours: MTWR 11:00 am - 12:15 pm, or by appointment
Prerequisites: Must be enrolled in graduate level
Description: The course offers a comprehensive coverage
of the basic concepts of Computing
and Data
Science. The first part of the course is devoted to the Computer
Organization and Design. This part discusses the main components of
computers and the basic principles of their operation. It demonstrates
the relationship between the software and hardware and focuses on the foundational
concepts that are the basis for current computer design. The second part
of the course discusses the fundametals of Data Science and some
related areas. The main emphasis here is on data distribution, as the way
data are generated, stored and used is naturally distributed. This part
discusses various levels of transmitting, storing and using information,
data and knowledge and surveys important areas as Information Theory, Switching,
Databases, Data classification, Distributed memory and data retrieval,
World Wide Web and Data/Web Mining.
Course objectives: Upon successful completion of the course the
student will be able to
-
Understand the basics of the Von Neumann computing architecture
-
Understand the basics of the MIPS instruction set architecture and write
simple assembly language programs
-
Design simple ALU and CPU at an abstract logical level
-
Understand the principles of Distributes Systems
-
Understand the basics of important related areas as Information Theory,
Switching, Databases, Information retrieval, World Wide Web and Data/Web
Mining
Required textbook: David A. Patterson and John L. Hennessy, Computer
Organization and Design: The Hardware/Software Interface, Fourth Edition,
Elsevier, 2008, ISBN: 978-0-12-374493-7.
Required software: SPIM
simulator: A free software simulator for running MIPS R2000 assembly
language programs available for Unix, DOS, and Windows.
WEB resources:
Class Participation: Regular attendance and active class participation
is expected from all students. If you must miss a class, try to inform
the instructor of this in advance. In case of missed classes and work due
to plausible reasons (such as illness or accidents) limitted assistance
will be offered. Unexcused absences will result in the student being totally
responsible for the make-up process.
Assignments, projects and grading: There will be a mid-term
test, a mid-term project and a term paper. There will
be also a programming assignment and a quiz. The final grade
will be based 40% on the term paper, 30% on the mid-term project, 20% on
the test, 5% on the programming assignment and 5% on the quiz, and will
be affected by classroom participation, conduct and attendance. All grades
will be availabe in Blackboard Vista. The letter grades will be
calculated according to the following table:
A |
A- |
B+ |
B |
B- |
C+ |
C |
C- |
D+ |
D |
D- |
F |
95-100 |
90-94 |
87-89 |
84-86 |
80-83 |
77-79 |
74-76 |
70-73 |
67-69 |
64-66 |
60-63 |
0-59 |
All assignments with their due dates are listed below in the class schedule
(the project descriptions are given on separate pages) and must be submitted
in Blackboard Vista available through CentralPipeline
(Blackboard Vista link) or directly at https://vista.csus.ct.edu/webct/logon/34918482122051.
Unexcused late submission policy: Assignments submitted more
than two days after the due date will be graded one letter grade
down. Projects submitted more than a week late will receive
two
letter grades down. No submissions will be accepted more
than two weeks after the due date.
Honesty policy: The CCSU honor code for Academic Integrity is
in effect in this class. It is expected that all students will conduct
themselves in an honest manner and NEVER claim work which is not their
own. Violating this policy will result in a substantial grade penalty,
and may lead to expulsion from the University. You may find it online at
http://web.ccsu.edu/academicintegrity/GradAcadMisconductPolicy.htm.
Please read it carefully.
Tentative schedule of classes, assignments, projects and tests (by week)
Note: Dates will be posted for all classes, project and test
due days. Check the schedule regularly for updates!
-
August 29, September 5, 10, 12: Computer
instruction set architecture
-
September 17, 19, 24: Computer
arithmetic and the ALU design
-
September 26: Programming
assignment due
-
September 26, October 1: CPU
datapath and control
-
October 3, 8, 10: Pipelining
-
October 10: Building
a 16-bit ALU (part of the Midterm
Project) due.
-
October 15, 17: Memory
hierarchies
-
October 22, 24: Interfacing
peripherals and multiprocessors
-
October 29: Midterm
Project due
-
November 1-7: Mid-term
test. To be taken online through Blackboard Vista
-
November 5, 7: No classes
-
November 12: Fundamentals
of distributed systems
-
November 14: Information
Theory
-
November 19: Switching
-
November 26: Database
management concepts
-
November 28: Distributed
memory and data retrieval
-
December 3: The
World Wide Web
-
December 5: Introduction
to Data Mining or term paper presentations
-
December 5-18: Quiz.
To be taken online through Blackboard Vista
-
December 17, 7-9: Last class (optional)
-
December 18: Term
paper due
CS502 - Week 1
Computer Architecture = Instruction Set Architecture + Machine Organization
Reading: Patterson & Hennessy - Chapter 1, Sections 2.1
- 2.3, 2.5 - 2.7, 2.10, 2.13 (optional), 2.16 - 2.20 (optional), Appendix
B, "Spim, pcspim, and xspim" in Section "Software" on the CD.
Lecture Slides: PDF
Lecture Notes:
-
Levels of Abstraction
-
Software:
-
Application
-
Operating System
-
Firmware
-
Instruction Set Architecture:
-
Organization of Programmable Storage
-
Data type and Structures: encodings and machine representation
-
Instruction set
-
Instruction Formats
-
Addressing Modes and Accessing Data and Instructions
-
Exception Handling
-
Hardware:
-
Instruction Set Processing
-
I/O System
-
Digital Design
-
Circuit Design
-
Layout
-
Basic Components of a Computer
-
Processor: Datapath and Control
-
Memory
-
I/O
-
Example: implementing A=B+C (from instructions to gates)
-
MIPS instructions
-
Arithmetic
-
Memory organization and load and store
-
Instruction types and formats
-
R-type
-
I-type for arithmetic
-
I-type for lw and sw
-
I-type for branch
-
Stored program concept: programs in memory, fetch&execute cycle
-
Von Neumann Architecture
-
CPU, Memory System, I/O system
-
Stored program concept: programs in memory, fetch&execute cycle
-
Instructions are executed sequentially
-
Turing machines
-
Non-Von Neumann Architecture:
-
The SPIM simulator
Exercises: Load this
program in the SPIM simulator and analyze the format of the insturctions.
Run the program with different values of X and Y and trace the execution
in step mode.
CS502 - Week 2
Computer arithmetic and ALU design
Reading: Patterson & Hennessy - Sections 2.4, 3.1
- 3.2, 3.3 - 3.4 (optional), 3.5 (floating-point representation only),
3.6 - 3.10 (optional), C.5 (CD).
Lecture Slides (PDF),
ALU diagram (PDF)
Lecture Notes:
-
Representing numbers: sign bit, one's complement, two's complement.
-
Arithmetic: addition, subtraction, detecting overflow.
-
Building ALU - hierarchical approach
-
Floating point numbers
-
Scientific notation: (-1)^sign * significand * 2^exponent
-
Range and precision (overflow and underflow).
-
IEEE 754 floating point standard - allows integer comparison:
-
normalized representation
-
implicit leading 1
-
exponent is biased: exponent in [0..0 (most negative), 1..1 (most positive)]
-
bits of the significand represent the fraction between 0 and 1.
-
(-1)^S * (1 + s1*2^-1 + s2*2^-2 + ...) * 2^(exponent-bias)
-
Problems with floating point arithmetic
Tutorials and practice quizzes:
-
Two’s complement numbers:
-
Floating point:
CS502 - Week 3
CPU datapath and control
Reading: Patterson & Hennessy - Sections 4.1 - 4.4
Lecture Slides: PDF
Lecture Notes:
I. Building a Datapath
-
Abstract level implementation:
-
Instruction memory
-
Program counter
-
Register file
-
ALU
-
Data memory
-
Basic building elements
-
Combinational logic (e.g. ALU)
-
State elements (registers, memory)
-
Clocking methodology: edge triggered
-
Fetching instructions and incrementing the program counter
-
Register file and execution of R-type instructions
-
Datapath for lw and sw instructions (add data memory and sign extend)
-
Datapath for branch instructions
Demo:
II. Control
-
ALU control: mapping the opcode and function bits to the ALU control inputs
-
Designing the main control unit
-
Operation of the Datapath (single-cycle implementation):
-
R-type instructions
-
Load (store) word
-
Branching instructions
-
Problems of the single-cycle implementation
CS502 - Week 4
Pipelining
Reading: Patterson & Hennessy - Sections 4.5 - 4.8 (without
implementation details).
Lecture Slides: PDF
Lecture notes:
I. Introduction to Pipelining
-
Pipelining by analogy (laundry example):
-
Pipelining helps throughput of the entire workload
-
Multiple tasks operating simultaneously and using different resources
-
Potential speedup = number of pipe stages
-
The pipeline rate is limited by the slowest stage
-
Unbalanced lengths of pipe stages reduces the speedup
-
The time to "fill" the pipeline and the time "drain" it reduces the speedup
-
Stall for dependencies
-
Five stages of the load MIPS instruction
-
Insturction fetch.
-
Instruction decode.
-
Execute (ALU operation).
-
Memory access.
-
Write back (register write).
-
The pipelined datapath
-
Single cycle, multiple cycle vs. pipeline
-
Advantages of pipelined execution
II. Problems with pipelining (pipeline hazards)
-
Structural hazards: single memory
-
Control hazards:
-
Stall: wait until decision is clear
-
Predict: fixed prediction (e.g. fail), dynamic prediction (based on history)
-
Delayed brach (software solution):
add $4, $5, $6
beq $1, $2, $40
beq $1, $2, 40 ==> add
$4, $5, $6
lw $3, 300($0)
lw $3, 300($0)
-
Data hazards (dependecies backwards in time):
-
Forwarding (bypassing - hardware solution)
-
Reordering code (software solution)
lw $t0, 0($t1)
lw $t0, 0($t1)
lw $t2, 4($t1) ==>
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
Demo:
CS502 - Week 5
Memory hierarchies
Reading: Patterson & Hennessy - Sections 5.1 - 5.5.
Lecture Slides: PDF
Lecture notes:
-
Memory technologies and trends
-
Impact on performance
-
The need of hierarchical memory organization
-
The principle of locality
-
Memory hierarchy terminology
-
The Basics of caches
-
Direct-mapped cache: Accessing a cache (Cash index,Tag,Valid bit)
-
Writing to the cache (write-through and write-back schemes)
-
Improving cache performance by flexible placement of blocks in the cache
(slides
in PS, slides
in PDF)
-
Direct mapped: Cache index = (Block address) modulo (Cache size); no search;
small tag.
-
Set associative: Cache index = (Block address) modulo (Number of sets in
cache); search the set; larger tag.
-
Fully associative: Cache index is not determined; search the whole cache;
tag = address.
-
Choosing which block to replace: least recently used
-
The need of virtual memory
-
Many programs (processes) can use a single memory
-
Use a memory exceeding the size of the main memory
-
VM organization and terminology: virtual address, physical address, page,
page offset, page fault, memory mapping (translation).
-
Addressing pages:
-
Page table, page table register
-
Processes (active, inactive) and page tables
-
Page faults
-
Replacing pages: LRU, reference (use) bit
-
Write-back scheme (dirty bit)
Exercises
-
Write a sequence of memory references for which:
-
the direct mapped cache performs better than the 2-way associative cache;
-
the 2-way associative cache performs better than the fully associative
cache.
-
Exercises 5.3, 5.4 (pages 550, 551).
CS502 - Week 6
Interfacing peripherals and multiprocessors
Reading: Patterson & Hennessy - Chapters 6.1 - 6.8, 7.1
- 7.4
Lecture Slides: Chapter7.pdf
Lecture notes:
-
Interfacing Processors and Peripherals - Buses (slides
in PDF)
-
Buses: lines, transactions, types
-
Synchronous and asynchronous buses
-
Handshaking protocol
-
Bus access: master and slave
-
Bus arbitration schemes
-
Bus standards
-
Interfacing I/O devices to Memory, CPU and OS (slides
in PDF)
-
The role of the operating system in interfacing I/O devices to Memory
-
Controlling the I/O devices
-
Memory mapped I/O
-
Special I/O instructions
-
Communicating with the processor
-
Polling
-
Interrupt-driven I/O
-
Direct memory access (DMA)
-
Multiprocessors (slides
in PDF)
-
Basic approaches to sharing data and types of connectivity
-
Programming multiprocessors
-
Multiprocessors connected by a single bus
-
A parallel program
-
Multiprocessor cache coherency
-
Implementing a multiprocessor cache coherency protocol
-
Synchronization using coherency
-
Networks of muiltiprocessors (slides
in PDF)
-
Shared memory vs. multiple private memories
-
Centralized memory vs. distributed memory
-
Parallel programming by message passing
-
Distributed memory communication
-
Memory allocation
-
Clusters and network topology
-
Modern clusters
CS502 - Week 7
Fundamentals of distributed systems
Additional Reading: Andrew S. Tanenbaum, Maarten van Steen,
Distributed Systems: Principles and Paradigms.
Lecture Notes (in PDF):
-
Categories of computer systems
-
Conventional sequential machines (mainframes, minicomputers + network
of terminals) - multitasking, multiusers.
-
Conventional systems with special purpose components (specialized processors)
- single specialized task.
-
Multiprocessor systems - single task allowing parallel computation.
-
Distributed systems (computers connected by a network) - different task,
shared data.
-
Conventional systems with special purpose components
-
A special purpose unit (e.g. math processor) attached to the main bus
-
Back-end system (additional separate machine, e.g. graphic terminal)
-
Example: iDBP
-
file operations: positioning and manipulating a cursor in a file
-
used to implement relational database systems
-
add-on board or back-end system
-
Multiprocessor systems:
-
Multiple processors
-
Shared memory (single address space) vs. multiple private memories
-
Centralized memory vs. distributed memory
-
Categories of parallelism:
-
Single instruction stream, single data stream (SISD)
-
Single instruction stream, multiple data streams (SIMD)
-
Multiple instruction streams, single data stream (MISD)
-
Multiple instruction streams, multiple data streams (MIMD)
-
Distributed computer systems
-
Issues:
-
data location and security
-
load distribution
-
process migration
-
fault tolerance
-
Types:
-
homogeneous systems
-
heterogeneous systems
-
Distributed file systems:
-
Fault-tolerant networks:
-
redundancy (static-masking, dynamic)
-
consistency (strong, weak)
-
Firewalls and self-stabilizing networks
Examples of distributed systems
CS502 - Week 8
Information Theory
-
Fundamentals of information theory (information vs. data)
-
Probability (see http://en.wikipedia.org/wiki/Probability
or http://mathworld.wolfram.com/Probability.html)
-
Frequency approach - experiments
-
Objective approach - properties real world entities
-
Subjective approach - the uncertainty of the agent (agent's knowledge)
-
Example: the probability that the sun will rise tomorrow - undefined, 1,
1-e, Laplas estimate: (d+1)/(d+2); Laplas for k-outcomes: (n+1)/(N+k)
-
Joint probability: P(A.B)=P(A)*P(B) (if A
and B are independent)
-
Conditional probability: P(A.B)=P(A|B)*P(B)
-
Bayes Theorem (more in PDF):
P(A|B)=P(B|A)*P(A)/P(B)
-
Interpretation of Bayes: A - cause; B - effect; posterior=likelihood*prior/evidence
-
Law of total probability: A={A1,A2,...An},
exhaustive, mutually exclusive (P(Ai.Aj)=0;i\==j); P(B)=SUM
P(B|Ai)P(Ai)
-
Claude Shannon (see http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html)
-
Measuring information
-
Settings: source, message, recipient
-
Recipient's point of view: a measure of uncertainty, expected surprise
of getting the message
-
Basic formula: I(M)=-P(M) or I(M)=-log2P(M)
(minimal number of bits to encode the message)
-
Logarithmic scale: additivity: I(M1)+I(M2)=-logP(M1)-logP(M2)=-log(P(M1)P(M2))=I(M1
and M2)
-
Log 2 scale: reducing the uncertainty (halving strategy, guess a number
in [0,7]), measuring unit: bit
-
Entropy (average information): uncertainty of the recipient, the expected
surprise of getting an arbitrary message from a set of possible messages.
-
H=-P1.logP1-P2.logP2-...-Pn.logPn
-
P1+P2+...+Pn=1
-
Example1: H({A,~A})=-P(A).logP(A)-P(~A).logP(~A). Maximum
for P(A)=1/2, minimum for P(A)=1
-
Example2: X={0,1,...,7}, P(X=1)=P(X=2)=...=p(X=7)=1/8;
H(X)=-8*(1/8)*log(1/8)=3 (bits)
-
Sampling
Theorem
-
Representing signals: frequency, time, amplitude
-
Baseband signals: F(t) limited in frequency
in the interval [-W,W]
-
Broadband signals (carrier based): moved in the interval [Wc-W,Wc+W],
e.g. F(t)=Fs(t)sin(Wct)
-
Representing signal by sample points (theoretically, infinite number of
points required)
-
locating a point in space: coordinate systems
-
locating a function F(t) : vector
(F(t1),F(t2),...,F(tn)),
tn - sample points
-
orthonormal sets of functions: Integral(a,b)
Gn(x)Gm(x)dx
= {1 for n=m; 0 for n\=m}
-
expansion of a function (Fourier series): F(x)=Sum(n=-inf,
n=+inf)CnGn(x),
Cm=Integral(-inf,+inf) F(x)Gm(x)dx
-
Sampling theorem (see also http://mathworld.wolfram.com/SamplingTheorem.html)
-
{Sincn(t)}
in [-T,T], T-> +inf; Sincn(t)=sin(2piW(t-n/(2W))/2piW(t-n/(2W)
-
F(t)=Sum(n=-inf, n=+inf) Cn
sin(2piW(t-n/(2W))/2piW(t-n/(2W)
-
F(tn)=Cn
(all terms above are 0 except one, sin(0)/0=1)
-
F(t)=Sum(n=-inf, n=+inf) F(tn)sin(2piW(t-tn)/2piW(t-tn),
tn=n/(2W) (sampling
theorem)
-
constraints: F(t) is limited in frequency
in [-W,W], infinite time duration (infinite
number of sample points)
-
AD and DA conversion: quantization (thresholds), amplifiers vs. repeaters,
generally very low error rates (regenerating digital values at relay points)
-
Error detection and correction
-
Binary case: {000,111} for {0,1} - one error correcting code
-
Arbitrary waveform: three instead of one sample point - widening the bandwidth
(2TW dimensions)
-
Demonstration
Applets from Information Technology: Inside and Outside by David Cyganski,
John A. Orr with Richard F. Vaz.
CS502 - Week 9
Switching
-
The importance of switching in communication - the cost of switching is
high
-
Definition: transfer input sample points to the correct output ports
at the correct time
-
Terminology
-
Switching
-
Digital switching (sample points amplitudes are 0's and 1's)
-
PABX
-
Circuit
-
Circuit switching
-
Packet switching
-
Voice digitization: W=3KHz, sampling at 2*3=6 or 8KHz, 256 levels for quantization
(8 bits), Bit rate=64Kb/s
-
Telephone switching
-
Time division multiplexing: time slot (0.1 ms), field, frame; 125ms/0.8=150
channels + time for synchronization and control
-
Switch architecture
-
Sampling input signals, storing values in memory, placing values in the
proper field and frame of the output sequence
-
Need for more channels: hierarchical switching
-
Combining time and space switching
-
General framework for switching
-
time, space and frequency (broadband signals) switching
-
synchronization (single clock) and buffering (memory)
-
set-up time and delay (propagation time)
-
"call duration" assignment vs. dynamic assignment
-
in-band and out-of-band signaling
-
Circuit (synchronous) vs. packet (asynchronous) switching: control and
routing overhead, virtual packet switching
-
Switching techniques and networking: switching is the technology allowing
to get a message between the nodes of a network
-
Crossbar switching: mechanical (in the past) or electronic.
-
Bus and cable switches: computer buses or cables (switching + transportation
= network)
-
Token passing approach (similar to the locks used by multiprocessors connected
by a bus)
-
Ethernet approach: cable or ring, packets, conflicts, resending
-
Synchronization and Hub switch: star networks (no conflicts)
Lecture
slides in PPT
CS502 - Week 10
Database management concepts
-
Database Management Systems (DBMS)
-
Managing large quantity of structured data
-
Efficient retrieval and modification: query processing and optimization
-
Sharing data: multiple users use and manipulate data
-
Controlling the access to data: maintaining the data integrity
-
An example of a database (relational): relations (tables), attributes (columns),
tuples (rows). Example query: Salesperson='Mary' AND Price>100.
-
Database schema (e.g. relational): names and types of attributes, addresses,
indexing, statistics, authorization rules to access data etc.
-
Data independence: separation of the physical and logical data (particularly
important for distributed systems). The mapping between them is provided
by the schema.
-
Architecture of a DBMS - three levels: external, conceptual and internal
schema
-
Types of DBMS
-
The data structures supported: tables (relational), trees, networks, objects
-
Type of service provided: high level query language, programming primitives
etc.
-
Basic DBMS types
-
Linear files
-
Sequence of records with a fixed format usually stored on a single file
-
Limitation: single file
-
Example query: Salesperson='Mary' AND Price>100
-
Hierarchical structure
-
Trees of records: one-to-many relationships
-
Limitations. ITEM-SUPPLIER: many-to-many relationship (requires duplicating
records); problems when updated
-
Retrieval requires knowing the structure (limited data independence): traversing
the tree from top to bottom using a procedural language
-
Network structure
-
Similar to the hierarchical database with the implementation of many-to-many
relationships
-
Record and set types
-
Relational structure
-
Relations, attributes, tuples
-
Primary key (unique combination of attributes for each tuple)
-
Foreign keys: relationships between tuples (many-to-many). Example: SUPPLIES
defines relations between ITEM and SUPPLIER tuples.
-
Advantages: many-to-many relationships, high level declarative query
language (e.g. SQL)
-
SQL example (retrieve all items supplied by a supplier located in Troy):
SELECT ItemName
FROM ITEM, SUPPLIES, SUPPLIER
WHERE SUPPLIER.City = "Troy" AND
SUPPLIER.Supplier# = SUPPLIES.Supplier#
AND
SUPPLIES.Item# = ITEM.Item#
-
Programming language interfaces: including SQL queries in the code
-
Object-Oriented structure
-
Objects (collection of data items and procedures) and interactions between
them.
-
Is this really a new paradigm, or a special case of network structure?
-
Separate implementation vs. implementation on top of a RDBMS
-
Retrieving and manipulating data: query processing
-
Parsing and validating a query: data dictionary - a relation listing
all relations and relations listing the attributes
-
Plans for computing the query: list of possible way to execute the
query, estimated cost for each. Example:
SELECT ItemNames, Price
FROM ITEM, SALES
WHERE SALES.Item# = ITEM.Item# AND Salesperson="Mary"
-
Index: B-tree index, drawbacks - additional space, updating; indexing
not all relations (e.g. the keys only)
-
Estimating the cost for computing a query: size of the relation, existence/size
of the idices.Example: estimating Attribute=value with a given
number of tuples and the size of the index.
-
Query optimization: finding the best plan (minimizing the computational
cost and the size of the intermediate results), subsets of tuples, projection
and join.
-
Static and dynamic optimization
-
Database views: creating user defined subsets of the database, improving
the user interface. Example:
CREATE VIEW MarySales(ItemName,Price)
AS SELECT ItemName, Price
FROM ITEM, SALES
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary"
Then the query:
SELECT ItemName
FROM MarySales
WHERE Proce>100
translates to:
SELECT ItemName
FROM ITEM, SALES
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary" AND Price>100
-
Data integrity
-
Integrity constraints: semantic conditions on the data
-
Individual constraints on data items
-
Uniqueness of the primary keys
-
Dependencies between relations
-
Concurrency control
-
Steps in executing a query.
-
Concurrent users of the database, interfering the execution of one query
by another
-
Transaction: a set of operations that takes the database from one
consistent state to another
-
Solving the concurrency control problem: making transactions atomic
operations (one at a time)
-
Concurrent transactions: serializability theory (two-phase locking), read
lock (many), write lock (one).
-
Seriazible transactions: first phase - accumulating locks, second phase
- releasing locks.
-
Deadlocks: deadlock detection algorithms.
-
Distributed execution problems: release a lock at one node (all locks accumulated
at the other node?); strict two-phase locking
-
Backup and recovery
-
The problem of keeping a transaction atomic: successful or failed (What
if some of the intermediate steps failed)
-
Log of database activity: use the log to undo a failed transaction.
-
More problems: when to write the log, failure of the recovery system executing
the log.
-
Security and access control
-
Access rules for relations or attributes. Stored in a special relation
(part of the data dictionary).
-
Content-independent and content-dependent access control
-
Content-dependent control: access to a view only or query modification(e.g.
and-ing a predicate to the WHERE clause)
-
Discretionary and mandatory access control
-
Client-Server architectures
-
Knowledge Bases and KBS (and area of AI)
-
Information, Data, Knowledge (data in a form that allows reasoning)
-
Basic components of a KBS
-
Knowledge base
-
Inference (reasoning) mechanism (e.g. forward/backward chaining)
-
Explanation mechanism/Interface
-
Rule-based systems (medical diagnostics, credit evaluation etc.)
Lecture
slides in PPT
CS502 - Week 11
Distributed memory and data/informition retrieval
-
The need of memory hierarchy
-
Massive data storage requirements (e.g. a satellite orbiting Earth generates
1 terabyte data per day)
-
Difference in the cost and capacity of different types of memory
-
Data location
-
Data location factors
-
Cost
-
Performance
-
Reliability
-
Application software
-
Directory - a mechanism to locate a data object
-
Directory location: may be different from the data location
-
Directory server: a separate system, not a part of the application program
-
Directory structure: usually hierarchical
-
Directory search: finding out the data location and making use of this
fact (e.g. designing an application)
-
Providing access control
-
Directory systems
-
Pure approaches
-
Master control directory: a single directory containing all the information
-
Directory server: a single computer providing all directory functions (contains
the master directory)
-
Fully replicated directories: copies of the directory at different locations
(concurrency control required)
-
Local directories: contain information on local data and usually stored
at the same location
-
Point of control directories: located at the point where the access control
is exercised (e.g. hard disk)
-
Problems with scaling: hierarchical system of directories (different
directories at different levels)
-
Important issues
-
Directory backup
-
Coherency and consistency
-
Local caching
-
Directories and access control (different functions). Points of access
control
-
Application program (problems with using data provided by other applications)
-
DBMS (the most common approach)
-
Control at the physical location of data (e.g. library)
-
Function of the communication networks (e.g. preventing a user from accessing
another node, telephone system)
-
Information retrieval (Slides
in PPT)
-
Basics
-
Finding relevant data using irrelevant keys
-
Example: database of photographic images sorted by number, date.
-
DBMS: Well structured data according to the information content
-
Text document retrieval (not well structured data)
-
Document retrieval queries
-
NLP: requires structured data to match the translated query
-
Abstracting documents
-
Keyword search: boolean, weighted
-
Evaluating retrieval quality
-
Precision: retrieved and relevant /retrieved
-
Recall: retrieved and relevant /relevant
-
Vector space model
-
Document and query vectors
-
Relevance ranking: cosine similarity
-
Access methods
-
Full text scanning: hardware support
-
Document signature: a fixed length bit string for each document (precision
and recall easily computed)
-
Inverted index: keyword or phrase indexing
-
Information Retrieval Resources and Demos: http://ir.exp.sis.pitt.edu/resources/
-
Web document retrieval (Slides
in PDF)
-
Crawling the Web
-
Analyzing the Web structure (Slides
in PDF)
-
Social Networks
-
Page Rank
-
Authorities and Hubs
-
Web document retrieval books and atricles
-
Data
Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage,
Chapters 1, 2
-
Mining the
Web: Discovering Knowledge from Hypertext Data, Chapters 7, 8.
-
The PageRank Citation
Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev
Motwani, and Terry Winograd. Stanford Digital Library Technology Project,
1998
-
Authoritative
Sources in a Hyperlinked Environment Jon M. Kleinberg. Journal of
the ACM 46:5 pp. 604-632, 1999
-
Document classification and clustering:
-
Document attributes: semantic (NLP, domain knowledge), statistical (keyword
frequencies, vector space model, TFIDF), mixed
-
Similarity measures: cosine similarity, distance, information compression
metrics
-
The Machine Learning approach
-
Bayesian text classification (reading: Tom
Mitchell, Machine Learning, McGraw Hill, 1997)
-
Clustering algorithms (reading: A
Comparison of Document Clustering Techniques)
-
Flat clustering: cluster and centroid (typical cluster document), k-means
algorithm
-
Hierarchical clustering (example Dendrogram)
-
Web Mining
-
Web content mining - discovery of Web document content patterns (text mining).
-
Web structure mining - discovery of hypertext/linking structure patterns
-
use hyperlinks to enhance text classification
-
page ranking
-
modeling and measuring the Web
-
Web usage mining - discovery of web users activity patterns
-
mining web server logs
-
mining client machine access logs
-
Web Mining books
CS502 - Week 12
The World Wide Web
-
A little history
-
1989, CERN: distribution of linked documents (nuclear physics)
-
1991, text-based prototype
-
1993, First graphical interface - Mosaic
-
1994: WWW consortium (CERN, MIT):
http://www.w3.org
-
The client side
-
WEB documents (pages) connected via hyperlinks (hypertext).
-
Hyperlinks: highlighted strings in text or images.
-
Browsers (text-based or graphical): tools for navigating the WEB.
-
Forms and Java applets
-
The server side
-
URL (Uniform Resource Locator): <protocol name>://<machine name>/<file
name>, e.g. http://www.w3.org/History.html
-
Steps of fetching http://www.w3.org/History.html
-
The browser determines the URL
-
The browser asks the local DNS (Domain Name System) server for the IP (Internet
Protocol) address
-
DNS replies with 18.23.0.23.
-
The browser makes a TCP (Transmission Control Protocol) connection to port
80 on 18.23.0.23
-
It then sends a GET /hypertext/WWW/History.html
-
The www.w3.org server send the file History.html
-
The TCP connection is released
-
The browser displays all the text in History.html
-
The browser displays all the images in History.html (new TCP connection
for each image)
-
Example of HTTP protocol in text:
C: telnet www.w3.org 80
T: Trying 18.23.0.23 ...
T: Connected to www.w3.org
T: Escape character is '^]'
C: GET /hypertext/WWW/TheProject.html HTTP/1.0
HTTP/1.0 200 Document follows
MIME-Version: 1.0
Server: CERN/3.0
Content-Type: text/html
Content-Length: 8247
<HEAD><TITLE> The World Wide Web Consortium
(W3C)</TITLE><HEAD>
<BODY>
<H1><IMG ALIGN=MIDDLE ALT="W3C" SRC="Icons/WWW/w3c_main.gif">
The World Wide Web Consortium </H><P>
The World Wide Web is the universe of network-accessible
information.
...
</BODY>
-
Using other protocols
-
HTTP Browser <---> FTP server
-
HTTP Browser <---> FTP Proxy server <---> FTP server
-
Proxy servers: translating protocols, caching pages, filtering information
-
HyperText Transfer Protocol (HTTP)
-
Simple (GET without the protocol version) and full requests
-
Methods (commands)
-
GET: request to read a Web page encoded in MIME (Multipurpose Internet
Mail Extensions - adding a header to describe the encoding)
-
HEAD: request to read a Web page header
-
PUT: request to store a Web page (may include authentication headers)
-
POST: request to append new data to a Web page (e.g. posting a message
to a news group)
-
DELETE: request to delete a Web page (may include authentication headers)
-
LINK: Connects two existing pages
-
UNLINK: breaks an existing connection between two pages
-
Writing WB page in HTML
-
URL (Universal Resource Locator): a mechanism for naming and locating Web
pages
-
<Protocol>://<DNS name of the host>/<File name (with possible
shortcuts, e.g. ~user)>
-
Protocols:
-
The HTML language
-
Standardized markup language: how the documents are to be formatted and
reformatted (e.g. LaTeX), contrasted to WYSIWYG (not standardized, does
not allow reformatting)
-
Tags with parameters, e.g. <IMG SRC ="http://www.widget.com/image.gif"
ALT="AWI Logo">
-
Hyperlinks:
-
<A HREF="http://www.nasa.gov"> NASA's home page </A>
-
<A HREF="http://www.nasa.gov"> <IMG SRC="shuttle.gif" ALT="NASA">
</A>
-
Forms (HTML 2.0): two-way trafic between the page owner and page user,
<INPUT> tag. Example: Look-Up
Classes
-
CGI (Common Gateway Interface): a standard for handling forms' data. Example:
interfacing a database:
-
Write a script (program) to interface between a database and the Web
-
Store the script in the cgi-bin directory under an URL.
-
When retrieves a page located in cgi-bin the HTTP server executes
the script.
-
Put the script name in the ACTION parameter of the form.
-
The browser invokes the operation specified in METHOD (usually POST)
-
The script is started and presented with the form input data.
-
After the database retrieval the script produce an HTML file, which is
sent back to the browser.
-
This mechanism can be used to include selected ads in the Web page depending
on what the user is looking for.
-
Java
-
Applets
-
Implementing highly interactive Web pages
-
Running applications and using remote servers or databases
-
Adding animation and sound to the Web page without spawning and external
viewer
-
No need of standard for the Web protocols (HTTP, FTP etc.)
-
The Java system for the Web
-
A Java-to-bytecode compiler
-
A browser that understands applets (<APPLET> tag)
-
A bytecode interpreter
-
Security issues
-
Problem: running a program on the client's machine (possible crash, collecting
private information, consuming resources)
-
Solutions:
-
First barrier: strong typed language - array bounds checking, no pointers
(thus no memory access)
-
Second barrier: bytecode verifier
-
Third barrier: class loader (loading first system classes before looking
for user classes)
-
Fourth barrier: security measures built in the classes, e.g. file access
class (asking user if the applet needs file access).
-
More problems:
-
asking to access the /tmp directory
-
requiring network access to work
-
covert channels
-
Locating information on the Web
-
The
Web Challenges (How to turn the web data into web knowledge)
-
Web Search Engines
-
Topic Directories
-
Semantic Web
-
Crawling the Web
-
Information Retrieval and Web Search
-
Web Agents
-
Is
There an Intelligent Agent in Your Future?
-
A good internet agent needs to be:
-
Communicative: able to understand your goals, preferences and constraints.
-
Capable: able to take options rather than simply provide advice.
-
Autonomous: able to act without the user being in control the whole time.
-
Adaptive: able to learn from experience about both its tasks and about
its users preferences.
-
Agent research sites:
-
Semantic Web
MIPS Programming assignment
(5 points)
Due date: September 26
Write a program in MIPS assembler to perform some simple computation (e.g.
average of three integers, converting miles into kilometers, Fahrenheit
into Celsius etc.). The program must include:
-
At least one instruction from each instruction type: R-type arithmetic,
I-type arithmetic and Memory transfer.
-
Input and output through system calls.
-
Comments explaining the format and the meaning of each instruction.
Use the SPIM simulator
to debug and run the program and read Patterson & Hennessy, B.9
and "Spim, pcspim, and xspim" in Section "Software" on the CD. You may
also find additional information about MIPS programming using SPIM in CS
254 - Computer organization and assembly language programming and Introduction
to RISC Assembly Language Programming (see example programs).
Documentation and submission: Submit the source text of
the program as a file attachment through Blackboard Vista > CS 502 > MIPS
Programming Assignment.
Mid-term test (20 points)
The Midterm test will be available from Blackboard Vista (the dates will
be posted around midterm). There are 20 multiple choice or short answer
questions that have to be answered within 3 hours.
The test includes the following topics:
-
Number systems (binary, hexadecimal, two's complements, floating
point)
-
MIPS instruction set architecture
-
Single cycle datapath and control
-
Basic principles of the pipelined datapath
-
Identifying dependencies, and solving pipeline hazards
-
Memory hierarchy (principles of caches)
Building a 16-bit ALU (10 points)
Not available at this time.
Mid-term project (20 ponits): Designing
a mini MIPS machine
Not available at this time.
Term paper (40 points)
Not available at this time.
Quiz (5 points)
The quiz will be available in Blackboard Vista. There
are 20 multiple choice questions that will have to be answered within 1
hour.
Review topics
-
Categories of computer systems:
-
Conventional sequential computers
-
Conventional systems with special purpose components
-
Multiprocessor systems
-
Distributed systems
-
Distributed systems:
-
Issues: data location and security, load distribution, process migration,
fault tolerance
-
Types: homogeneous systems, heterogeneous systems
-
Information Theory
-
Probability: Bayes's theorem
-
Measuring information: basic formula (logarithmic scale, halving strategy),
entropy (average information)
-
Signals:
-
Representation: time, amplitude, frequency
-
Transmission: sampling (Sampling Theorem), digitizing (AD and DA conversion),
bandwidth
-
Switching:
-
Circuit switching
-
Packet switching
-
Time division multiplexing
-
Ethernet approach: packets, conflicts and resending
-
DBMS:
-
Application area: large quantity of structured data
-
Types: tables (relational), trees, networks, objects
-
Data independence: database schema
-
Relational DBMS
-
Basics of SQL
-
Query processing: plans, optimization
-
Database views
-
Data integrity
-
Integrity constraints
-
Concurrency control
-
Backup and recovery
-
Security and access control
-
Information retrieval
-
Application area: finding relevant data using irrelevant keys (not well
structured data)
-
Text document retrieval: inverted index, retrieval queries, retrieval quality
(precision, recall), relevance ranking
-
Analyzing the Web structure: page rank