The last class is on Thursday, May 8 at 6:45 (?)
The Quiz is available in Vista from May 7
The Term Paper is due on May 15
CS502 - Computing and Communication Technology
Spring-2008
Classes: TR 6:45 pm - 8:00 pm, Maria Sanford Hall 210
Instructor: Dr. Zdravko Markov, MS 307, (860)-832-2711, http://www.cs.ccsu.edu/~markov/,
e-mail: markovz at ccsu.edu
Office hours: TR 9:30am - 12:00pm, or by appointment
Prerequisites: Must be enrolled in graduate level
Description: The course offers a comprehensive coverage
of the basic concepts of computing and data technologies. The first
part of the course is devoted to the computer organization and design.
This part discusses the main components of computers and the basic principles
of their operation. It demonstrates the relationship between the software
and hardware and focuses on the foundational concepts that are the basis
for current computer design. The second part of the course discusses the
architecture of the distributed computer systems. The main emphasis here
is on data distribution, as the way data are generated, stored and used
is naturally distributed. This part discusses various levels of transmitting,
storing and using information, data and knowledge and surveys important
areas as Information Theory, Switching, Databases, Data classification,
Distributed memory and data retrieval, World Wide Web and Data Mining.
Course objectives: Upon successful completion of the course the
student will be able to
-
Understand the basics of the Von Neumann computing architecture
-
Understand the basics of the MIPS instruction set architecture and write
simple assembly language programs
-
Design simple ALU and CPU at an abstract logical level
-
Understand the principles of distributes systems and
-
Understand the basics of important related areas as Information Theory,
Switching, Databases, Information retrieval, World Wide Web and Data Mining
Required textbook: David A. Patterson and John L. Hennessy, Computer
Organization and Design: The Hardware/Software Interface, Third Edition,
Morgan Kaufmann Publishers, 2004, ISBN: 1-55860-604-1.
Required software: SPIM
simulator: A free software simulator for running MIPS R2000 assembly
language programs available for Unix, DOS, and Windows.
WEB resources:
Honesty policy: It is expected that all students will conduct themselves
in an honest manner (see the CCSU Student handbook), and NEVER claim work
which is not their own. Violating this policy will result in a substantial
grade penalty, and may lead to expulsion from the University.
Assignments, projects and grading: There will be a mid-term
test, a mid-term project and a term paper. There will
be also a programming assignment and a quiz. The final grade
will be based 40% on the term paper, 30% on the mid-term project, 20% on
the test, 5% on the programming assignment and 5% on the quiz. The letter
grades will be calculated according to the following table:
| A |
A- |
B+ |
B |
B- |
C+ |
C |
C- |
D+ |
D |
D- |
F |
| 95-100 |
90-94 |
87-89 |
84-86 |
80-83 |
77-79 |
74-76 |
70-73 |
67-69 |
64-66 |
60-63 |
0-59 |
Tentative schedule of classes (weeks) and assignments
-
Computer
instruction set architecture
-
Computer
arithmetic and the ALU design
-
February 19: Programming
assignment due. Week topic: CPU
datapath and control
-
March 4: Building
a 16-bit ALU (part of the Midterm
Project) due. Week topic: Pipelining
-
Memory
hierarchies
-
Interfacing
peripherals and multiprocessors
-
March 27: Midterm
Project due
-
Mid-term
test. To be taken online through Blackboard Vista between April 3 and
April 10.
-
Fundamentals
of distributed systems
-
Information
Theory
-
Switching
-
Database
management concepts
-
Distributed
memory and data retrieval
-
Quiz.due.
Week topic: The
World Wide Web
-
Introduction
to Data Mining
-
Term
paper due.
CS502 - Week 1
Computer Architecture = Instruction Set Architecture + Machine Organization
Reading: Patterson & Hennessy - Chapter 1, Sections 2.1
- 2.6, 2.8, 2.9, 2.16
Lecture Slides: PDF
Lecture Notes:
-
Levels of Abstraction
-
Software:
-
Application
-
Operating System
-
Firmware
-
Instruction Set Architecture:
-
Organization of Programmable Storage
-
Data type and Structures: encodings and machine representation
-
Instruction set
-
Instruction Formats
-
Addressing Modes and Accessing Data and Instructions
-
Exception Handling
-
Hardware:
-
Instruction Set Processing
-
I/O System
-
Digital Design
-
Circuit Design
-
Layout
-
Basic Components of a Computer
-
Processor: Datapath and Control
-
Memory
-
I/O
-
Example: implementing A=B+C (from instructions to gates)
-
MIPS instructions
-
Arithmetic
-
Memory organization and load and store
-
Instruction types and formats
-
R-type
-
I-type for arithmetic
-
I-type for lw and sw
-
I-type for branch
-
Stored program concept: programs in memory, fetch&execute cycle
-
Von Neumann Architecture
-
CPU, Memory System, I/O system
-
Stored program concept: programs in memory, fetch&execute cycle
-
Instructions are executed sequentially
-
Turing machines
-
Non-Von Neumann Architecture:
CS502 - Week 2
Computer arithmetic and ALU design
Reading: Patterson & Hennessy - Sections 3.1 -
3.4, 3.6 (without addition and multiplication), B.5 (CD).
Lecture Slides (PDF),
ALU diagram (PDF)
Lecture Notes:
-
Representing numbers: sign bit, one's complement, two's complement.
-
Arithmetic: addition, subtraction, detecting overflow.
-
Building ALU - hierarchical approach
-
Floating point numbers
-
Scientific notation: (-1)^sign * significand * 2^exponent
-
Range and precision (overflow and underflow).
-
IEEE 754 floating point standard - allows integer comparison:
-
normalized representation
-
implicit leading 1
-
exponent is biased: exponent in [0..0 (most negative), 1..1 (most positive)]
-
bits of the significand represent the fraction between 0 and 1.
-
(-1)^S * (1 + s1*2^-1 + s2*2^-2 + ...) * 2^(exponent-bias)
-
Problems with floating point arithmetic
Tutorials and practice quizzes:
-
Two’s complement numbers:
-
Floating point:
CS502 - Week 3
CPU datapath and control
Reading: Patterson & Hennessy - Sections 5.1 - 5.5 (without
"Defining the Control").
Lecture Slides: PDF
Lecture Notes:
I. Building a Datapath
-
Abstract level implementation:
-
Instruction memory
-
Program counter
-
Register file
-
ALU
-
Data memory
-
Basic building elements
-
Combinational logic (e.g. ALU)
-
State elements (registers, memory)
-
Clocking methodology: edge triggered
-
Fetching instructions and incrementing the program counter
-
Register file and execution of R-type instructions
-
Datapath for lw and sw instructions (add data memory and sign extend)
-
Datapath for branch instructions
Demo: http://www.it.jcu.edu.au/Subjects/cp2005/resources/animation/wk6animations.ppt
II. Control
-
ALU control: mapping the opcode and function bits to the ALU control inputs
-
Designing the main control unit
-
Operation of the Datapath (single-cycle implementation):
-
R-type instructions
-
Load (store) word
-
Branching instructions
-
Problems of the single-cycle implementation
CS502 - Week 4
Pipelining
Reading: Patterson & Hennessy - Sections 6.1, 6.2, 6.4
- 6.6 (without implementation details).
Lecture Slides: PDF
Lecture notes:
I. Introduction to Pipelining
-
Pipelining by analogy (laundry example):
-
Pipelining helps throughput of the entire workload
-
Multiple tasks operating simultaneously and using different resources
-
Potential speedup = number of pipe stages
-
The pipeline rate is limited by the slowest stage
-
Unbalanced lengths of pipe stages reduces the speedup
-
The time to "fill" the pipeline and the time "drain" it reduces the speedup
-
Stall for dependencies
-
Five stages of the load MIPS instruction
-
Insturction fetch.
-
Instruction decode.
-
Execute (ALU operation).
-
Memory access.
-
Write back (register write).
-
The pipelined datapath
-
Single cycle, multiple cycle vs. pipeline
-
Advantages of pipelined execution
II. Problems with pipelining (pipeline hazards)
-
Structural hazards: single memory
-
Control hazards:
-
Stall: wait until decision is clear
-
Predict: fixed prediction (e.g. fail), dynamic prediction (based on history)
-
Delayed brach (software solution):
add $4, $5, $6
beq $1, $2, $40
beq $1, $2, 40 ==> add
$4, $5, $6
lw $3, 300($0)
lw $3, 300($0)
-
Data hazards (dependecies backwards in time):
-
Forwarding (bypassing - hardware solution)
-
Reordering code (software solution)
lw $t0, 0($t1)
lw $t0, 0($t1)
lw $t2, 4($t1) ==>
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
Exercises: 6.7, 6.8, 6.15 (For More Practice, on CD)
Demo: http://www.web-ee.com/primers/files/MIPS/MIPS.htm
CS502 - Week 5
Memory hierarchies
Reading: Patterson & Hennessy - Sections 7.1 - 7.4.
Lecture Slides: PDF
Lecture notes:
-
Memory technologies and trends
-
Impact on performance
-
The need of hierarchical memory organization
-
The principle of locality
-
Memory hierarchy terminology
-
The Basics of caches
-
Direct-mapped cache: Accessing a cache (Cash index,Tag,Valid bit)
-
Writing to the cache (write-through and write-back schemes)
-
Improving cache performance by flexible placement of blocks in the cache
(slides
in PS, slides
in PDF)
-
Direct mapped: Cache index = (Block address) modulo (Cache size); no search;
small tag.
-
Set associative: Cache index = (Block address) modulo (Number of sets in
cache); search the set; larger tag.
-
Fully associative: Cache index is not determined; search the whole cache;
tag = address.
-
Choosing which block to replace: least recently used
-
The need of virtual memory
-
Many programs (processes) can use a single memory
-
Use a memory exceeding the size of the main memory
-
VM organization and terminology: virtual address, physical address, page,
page offset, page fault, memory mapping (translation).
-
Addressing pages:
-
Page table, page table register
-
Processes (active, inactive) and page tables
-
Page faults
-
Replacing pages: LRU, reference (use) bit
-
Write-back scheme (dirty bit)
CS502 - Week 6
Interfacing peripherals and multiprocessors
Reading: Patterson & Hennessy - Chapters 8, 9 (CD).
Lecture notes:
-
Interfacing Processors and Peripherals - Buses (slides
in PDF)
-
Buses: lines, transactions, types
-
Synchronous and asynchronous buses
-
Handshaking protocol
-
Bus access: master and slave
-
Bus arbitration schemes
-
Bus standards
-
Interfacing I/O devices to Memory, CPU and OS (slides
in PDF)
-
The role of the operating system in interfacing I/O devices to Memory
-
Controlling the I/O devices
-
Memory mapped I/O
-
Special I/O instructions
-
Communicating with the processor
-
Polling
-
Interrupt-driven I/O
-
Direct memory access (DMA)
-
Multiprocessors (slides
in PDF)
-
Basic approaches to sharing data and types of connectivity
-
Programming multiprocessors
-
Multiprocessors connected by a single bus
-
A parallel program
-
Multiprocessor cache coherency
-
Implementing a multiprocessor cache coherency protocol
-
Synchronization using coherency
-
Networks of muiltiprocessors (slides
in PDF)
-
Shared memory vs. multiple private memories
-
Centralized memory vs. distributed memory
-
Parallel programming by message passing
-
Distributed memory communication
-
Memory allocation
-
Clusters and network topology
-
Modern clusters
CS502 - Week 7
Fundamentals of distributed systems
Additional Reading: Andrew S. Tanenbaum, Maarten van Steen,
Distributed Systems: Principles and Paradigms.
Lecture Notes (in PDF):
-
Categories of computer systems
-
Conventional sequential machines (mainframes, minicomputers + network
of terminals) - multitasking, multiusers.
-
Conventional systems with special purpose components (specialized processors)
- single specialized task.
-
Multiprocessor systems - single task allowing parallel computation.
-
Distributed systems (computers connected by a network) - different task,
shared data.
-
Conventional systems with special purpose components
-
A special purpose unit (e.g. math processor) attached to the main bus
-
Back-end system (additional separate machine, e.g. graphic terminal)
-
Example: iDBP
-
file operations: positioning and manipulating a cursor in a file
-
used to implement relational database systems
-
add-on board or back-end system
-
Multiprocessor systems:
-
Multiple processors
-
Shared memory (single address space) vs. multiple private memories
-
Centralized memory vs. distributed memory
-
Categories of parallelism:
-
Single instruction stream, single data stream (SISD)
-
Single instruction stream, multiple data streams (SIMD)
-
Multiple instruction streams, single data stream (MISD)
-
Multiple instruction streams, multiple data streams (MIMD)
-
Distributed computer systems
-
Issues:
-
data location and security
-
load distribution
-
process migration
-
fault tolerance
-
Types:
-
homogeneous systems
-
heterogeneous systems
-
Distributed file systems:
-
Fault-tolerant networks:
-
redundancy (static-masking, dynamic)
-
consistency (strong, weak)
-
Firewalls and self-stabilizing networks
CS502 - Week 8
Information Theory
-
Fundamentals of information theory (information vs. data)
-
Probability (see http://mathworld.wolfram.com/Probability.html)
-
Frequency approach - experiments
-
Objective approach - properties real world entities
-
Subjective approach - the uncertainty of the agent (agent's knowledge)
-
Example: the probability that the sun will rise tomorrow - undefined, 1,
1-e, Laplas estimate: (d+1)/(d+2); Laplas for k-outcomes: (n+1)/(N+k)
-
Joint probability: P(A.B)=P(A)*P(B) (if A
and B are independent)
-
Conditional probability: P(A.B)=P(A|B)*P(B)
-
Bayes Theorem (more in PDF):
P(A|B)=P(B|A)*P(A)/P(B)
-
Interpretation of Bayes: A - cause; B - effect; posterior=likelihood*prior/evidence
-
Law of total probability: A={A1,A2,...An},
exhaustive, mutually exclusive (P(Ai.Aj)=0;i\==j); P(B)=SUM
P(B|Ai)P(Ai)
-
Claude Shannon (see http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html)
-
Measuring information
-
Settings: source, message, recipient
-
Recipient's point of view: a measure of uncertainty, expected surprise
of getting the message
-
Basic formula: I(M)=-P(M) or I(M)=-log2P(M)
(minimal number of bits to encode the message)
-
Logarithmic scale: additivity: I(M1)+I(M2)=-logP(M1)-logP(M2)=-log(P(M1)P(M2))=I(M1
and M2)
-
Log 2 scale: reducing the uncertainty (halving strategy, guess a number
in [0,7]), measuring unit: bit
-
Entropy (average information): uncertainty of the recipient, the expected
surprise of getting an arbitrary message from a set of possible messages.
-
H=-P1.logP1-P2.logP2-...-Pn.logPn
-
P1+P2+...+Pn=1
-
Example1: H({A,~A})=-P(A).logP(A)-P(~A).logP(~A). Maximum
for P(A)=1/2, minimum for P(A)=1
-
Example2: X={0,1,...,7}, P(X=1)=P(X=2)=...=p(X=7)=1/8;
H(X)=-8*(1/8)*log(1/8)=3 (bits)
-
Sampling
Theorem
-
Representing signals: frequency, time, amplitude
-
Baseband signals: F(t) limited in frequency
in the interval [-W,W]
-
Broadband signals (carrier based): moved in the interval [Wc-W,Wc+W],
e.g. F(t)=Fs(t)sin(Wct)
-
Representing signal by sample points (theoretically, infinite number of
points required)
-
locating a point in space: coordinate systems
-
locating a function F(t) : vector
(F(t1),F(t2),...,F(tn)),
tn - sample points
-
orthonormal sets of functions: Integral(a,b)
Gn(x)Gm(x)dx
= {1 for n=m; 0 for n\=m}
-
expansion of a function (Fourier series): F(x)=Sum(n=-inf,
n=+inf)CnGn(x),
Cm=Integral(-inf,+inf) F(x)Gm(x)dx
-
Sampling theorem (see also http://mathworld.wolfram.com/SamplingTheorem.html)
-
{Sincn(t)}
in [-T,T], T-> +inf; Sincn(t)=sin(2piW(t-n/(2W))/2piW(t-n/(2W)
-
F(t)=Sum(n=-inf, n=+inf) Cn
sin(2piW(t-n/(2W))/2piW(t-n/(2W)
-
F(tn)=Cn
(all terms above are 0 except one, sin(0)/0=1)
-
F(t)=Sum(n=-inf, n=+inf) F(tn)sin(2piW(t-tn)/2piW(t-tn),
tn=n/(2W) (sampling
theorem)
-
constraints: F(t) is limited in frequency
in [-W,W], infinite time duration (infinite
number of sample points)
-
AD and DA conversion: quantization (thresholds), amplifiers vs. repeaters,
generally very low error rates (regenerating digital values at relay points)
-
Error detection and correction
-
Binary case: {000,111} for {0,1} - one error correcting code
-
Arbitrary waveform: three instead of one sample point - widening the bandwidth
(2TW dimensions)
-
Demonstration
Applets from Information Technology: Inside and Outside by David Cyganski,
John A. Orr with Richard F. Vaz.
CS502 - Week 9
Switching
-
The importance of switching in communication - the cost of switching is
high
-
Definition: transfer input sample points to the correct output ports
at the correct time
-
Terminology
-
Switching
-
Digital switching (sample points amplitudes are 0's and 1's)
-
PABX
-
Circuit
-
Circuit switching
-
Packet switching
-
Voice digitization: W=3KHz, sampling at 2*3=6 or 8KHz, 256 levels for quantization
(8 bits), Bit rate=64Kb/s
-
Telephone switching
-
Time division multiplexing: time slot (0.1 ms), field, frame; 125ms/0.8=150
channels + time for synchronization and control
-
Switch architecture
-
Sampling input signals, storing values in memory, placing values in the
proper field and frame of the output sequence
-
Need for more channels: hierarchical switching
-
Combining time and space switching
-
General framework for switching
-
time, space and frequency (broadband signals) switching
-
synchronization (single clock) and buffering (memory)
-
set-up time and delay (propagation time)
-
"call duration" assignment vs. dynamic assignment
-
in-band and out-of-band signaling
-
Circuit (synchronous) vs. packet (asynchronous) switching: control and
routing overhead, virtual packet switching
-
Switching techniques and networking: switching is the technology allowing
to get a message between the nodes of a network
-
Crossbar switching: mechanical (in the past) or electronic.
-
Bus and cable switches: computer buses or cables (switching + transportation
= network)
-
Token passing approach (similar to the locks used by multiprocessors connected
by a bus)
-
Ethernet approach: cable or ring, packets, conflicts, resending
-
Synchronization and Hub switch: star networks (no conflicts)
Lecture
slides in PPT
CS502 - Week 10
Database management concepts
-
Database Management Systems (DBMS)
-
Managing large quantity of structured data
-
Efficient retrieval and modification: query processing and optimization
-
Sharing data: multiple users use and manipulate data
-
Controlling the access to data: maintaining the data integrity
-
An example of a database (relational): relations (tables), attributes (columns),
tuples (rows). Example query: Salesperson='Mary' AND Price>100.
-
Database schema (e.g. relational): names and types of attributes, addresses,
indexing, statistics, authorization rules to access data etc.
-
Data independence: separation of the physical and logical data (particularly
important for distributed systems). The mapping between them is provided
by the schema.
-
Architecture of a DBMS - three levels: external, conceptual and internal
schema
-
Types of DBMS
-
The data structures supported: tables (relational), trees, networks, objects
-
Type of service provided: high level query language, programming primitives
etc.
-
Basic DBMS types
-
Linear files
-
Sequence of records with a fixed format usually stored on a single file
-
Limitation: single file
-
Example query: Salesperson='Mary' AND Price>100
-
Hierarchical structure
-
Trees of records: one-to-many relationships
-
Limitations. ITEM-SUPPLIER: many-to-many relationship (requires duplicating
records); problems when updated
-
Retrieval requires knowing the structure (limited data independence): traversing
the tree from top to bottom using a procedural language
-
Network structure
-
Similar to the hierarchical database with the implementation of many-to-many
relationships
-
Record and set types
-
Relational structure
-
Relations, attributes, tuples
-
Primary key (unique combination of attributes for each tuple)
-
Foreign keys: relationships between tuples (many-to-many). Example: SUPPLIES
defines relations between ITEM and SUPPLIER tuples.
-
Advantages: many-to-many relationships, high level declarative query
language (e.g. SQL)
-
SQL example (retrieve all items supplied by a supplier located in Troy):
SELECT ItemName
FROM ITEM, SUPPLIES, SUPPLIER
WHERE SUPPLIER.City = "Troy" AND
SUPPLIER.Supplier# = SUPPLIES.Supplier#
AND
SUPPLIES.Item# = ITEM.Item#
-
Programming language interfaces: including SQL queries in the code
-
Object-Oriented structure
-
Objects (collection of data items and procedures) and interactions between
them.
-
Is this really a new paradigm, or a special case of network structure?
-
Separate implementation vs. implementation on top of a RDBMS
-
Retrieving and manipulating data: query processing
-
Parsing and validating a query: data dictionary - a relation listing
all relations and relations listing the attributes
-
Plans for computing the query: list of possible way to execute the
query, estimated cost for each. Example:
SELECT ItemNames, Price
FROM ITEM, SALES
WHERE SALES.Item# = ITEM.Item# AND Salesperson="Mary"
-
Index: B-tree index, drawbacks - additional space, updating; indexing
not all relations (e.g. the keys only)
-
Estimating the cost for computing a query: size of the relation, existence/size
of the idices.Example: estimating Attribute=value with a given
number of tuples and the size of the index.
-
Query optimization: finding the best plan (minimizing the computational
cost and the size of the intermediate results), subsets of tuples, projection
and join.
-
Static and dynamic optimization
-
Database views: creating user defined subsets of the database, improving
the user interface. Example:
CREATE VIEW MarySales(ItemName,Price)
AS SELECT ItemName, Price
FROM ITEM, SALES
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary"
Then the query:
SELECT ItemName
FROM MarySales
WHERE Proce>100
translates to:
SELECT ItemName
FROM ITEM, SALES
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary" AND Price>100
-
Data integrity
-
Integrity constraints: semantic conditions on the data
-
Individual constraints on data items
-
Uniqueness of the primary keys
-
Dependencies between relations
-
Concurrency control
-
Steps in executing a query.
-
Concurrent users of the database, interfering the execution of one query
by another
-
Transaction: a set of operations that takes the database from one
consistent state to another
-
Solving the concurrency control problem: making transactions atomic
operations (one at a time)
-
Concurrent transactions: serializability theory (two-phase locking), read
lock (many), write lock (one).
-
Seriazible transactions: first phase - accumulating locks, second phase
- releasing locks.
-
Deadlocks: deadlock detection algorithms.
-
Distributed execution problems: release a lock at one node (all locks accumulated
at the other node?); strict two-phase locking
-
Backup and recovery
-
The problem of keeping a transaction atomic: successful or failed (What
if some of the intermediate steps failed)
-
Log of database activity: use the log to undo a failed transaction.
-
More problems: when to write the log, failure of the recovery system executing
the log.
-
Security and access control
-
Access rules for relations or attributes. Stored in a special relation
(part of the data dictionary).
-
Content-independent and content-dependent access control
-
Content-dependent control: access to a view only or query modification(e.g.
and-ing a predicate to the WHERE clause)
-
Discretionary and mandatory access control
-
Client-Server architectures
-
Knowledge Bases and KBS (and area of AI)
-
Information, Data, Knowledge (data in a form that allows reasoning)
-
Basic components of a KBS
-
Knowledge base
-
Inference (reasoning) mechanism (e.g. forward/backward chaining)
-
Explanation mechanism/Interface
-
Rule-based systems (medical diagnostics, credit evaluation etc.)
Lecture
slides in PPT
CS502 - Week 11
Distributed memory and data/informition retrieval
-
The need of memory hierarchy
-
Massive data storage requirements (e.g. a satellite orbiting Earth generates
1 terabyte data per day)
-
Difference in the cost and capacity of different types of memory
-
Data location
-
Data location factors
-
Cost
-
Performance
-
Reliability
-
Application software
-
Directory - a mechanism to locate a data object
-
Directory location: may be different from the data location
-
Directory server: a separate system, not a part of the application program
-
Directory structure: usually hierarchical
-
Directory search: finding out the data location and making use of this
fact (e.g. designing an application)
-
Providing access control
-
Directory systems
-
Pure approaches
-
Master control directory: a single directory containing all the information
-
Directory server: a single computer providing all directory functions (contains
the master directory)
-
Fully replicated directories: copies of the directory at different locations
(concurrency control required)
-
Local directories: contain information on local data and usually stored
at the same location
-
Point of control directories: located at the point where the access control
is exercised (e.g. hard disk)
-
Problems with scaling: hierarchical system of directories (different
directories at different levels)
-
Important issues
-
Directory backup
-
Coherency and consistency
-
Local caching
-
Directories and access control (different functions). Points of access
control
-
Application program (problems with using data provided by other applications)
-
DBMS (the most common approach)
-
Control at the physical location of data (e.g. library)
-
Function of the communication networks (e.g. preventing a user from accessing
another node, telephone system)
-
Information retrieval (Slides
in PPT)
-
Basics
-
Finding relevant data using irrelevant keys
-
Example: database of photographic images sorted by number, date.
-
DBMS: Well structured data according to the information content
-
Text document retrieval (not well structured data)
-
Document retrieval queries
-
NLP: requires structured data to match the translated query
-
Keyword search: boolean, weighted
-
Abstracting documents
-
Evaluating retrieval quality
-
Precision: retrieved and relevant /retrieved
-
Recall: retrieved and relevant /relevant
-
Access methods
-
Full text scanning: hardware support
-
Keyword or phrase indexing: storage overhead (average 30%-70%), index updating
-
Document signature: a fixed length bit string for each document (precision
and recall easily computed)
-
Information Retrieval Resources and Demos: http://ir.exp.sis.pitt.edu/resources/
-
Web document retrieval
-
Crawling the Web
-
Analyzing
the Web structure
-
Social Networks
-
Page Rank
-
Authorities and Hubs
-
Web document retrieval books and atricles
-
Data
Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage,
Chapters 1, 2
-
Mining the
Web: Discovering Knowledge from Hypertext Data, Chapters 7, 8.
-
The PageRank Citation
Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev
Motwani, and Terry Winograd. Stanford Digital Library Technology Project,
1998
-
Authoritative
Sources in a Hyperlinked Environment Jon M. Kleinberg. Journal of
the ACM 46:5 pp. 604-632, 1999
-
Document classification and clustering:
-
Document attributes: semantic (NLP, domain knowledge), statistical (keyword
frequencies, vector space model, TFIDF), mixed
-
Similarity measures: cosine similarity, distance, information compression
metrics
-
The Machine Learning approach
-
Bayesian text classification (reading: Tom
Mitchell, Machine Learning, McGraw Hill, 1997)
-
Clustering algorithms (reading: A
Comparison of Document Clustering Techniques)
-
Flat clustering: cluster and centroid (typical cluster document), k-means
algorithm
-
Hierarchical clustering (example Dendrogram)
-
Web Mining
-
Web content mining - discovery of Web document content patterns (text mining).
-
Web structure mining - discovery of hypertext/linking structure patterns
-
use hyperlinks to enhance text classification
-
page ranking
-
modeling and measuring the Web
-
Web usage mining - discovery of web users activity patterns
-
mining web server logs
-
mining client machine access logs
-
Web Mining books
CS502 - Week 12
The World Wide Web
-
A little history
-
1989, CERN: distribution of linked documents (nuclear physics)
-
1991, text-based prototype
-
1993, First graphical interface - Mosaic
-
1994: WWW consortium (CERN, MIT):
http://www.w3.org
-
The client side
-
WEB documents (pages) connected via hyperlinks (hypertext).
-
Hyperlinks: highlighted strings in text or images.
-
Browsers (text-based or graphical): tools for navigating the WEB.
-
Forms and Java applets
-
The server side
-
URL (Uniform Resource Locator): <protocol name>://<machine name>/<file
name>, e.g. http://www.w3.org/History.html
-
Steps of fetching http://www.w3.org/History.html
-
The browser determines the URL
-
The browser asks the local DNS (Domain Name System) server for the IP (Internet
Protocol) address
-
DNS replies with 18.23.0.23.
-
The browser makes a TCP (Transmission Control Protocol) connection to port
80 on 18.23.0.23
-
It then sends a GET /hypertext/WWW/History.html
-
The www.w3.org server send the file History.html
-
The TCP connection is released
-
The browser displays all the text in History.html
-
The browser displays all the images in History.html (new TCP connection
for each image)
-
Example of HTTP protocol in text:
C: telnet www.w3.org 80
T: Trying 18.23.0.23 ...
T: Connected to www.w3.org
T: Escape character is '^]'
C: GET /hypertext/WWW/TheProject.html HTTP/1.0
HTTP/1.0 200 Document follows
MIME-Version: 1.0
Server: CERN/3.0
Content-Type: text/html
Content-Length: 8247
<HEAD><TITLE> The World Wide Web Consortium
(W3C)</TITLE><HEAD>
<BODY>
<H1><IMG ALIGN=MIDDLE ALT="W3C" SRC="Icons/WWW/w3c_main.gif">
The World Wide Web Consortium </H><P>
The World Wide Web is the universe of network-accessible
information.
...
</BODY>
-
Using other protocols
-
HTTP Browser <---> FTP server
-
HTTP Browser <---> FTP Proxy server <---> FTP server
-
Proxy servers: translating protocols, caching pages, filtering information
-
HyperText Transfer Protocol (HTTP)
-
Simple (GET without the protocol version) and full requests
-
Methods (commands)
-
GET: request to read a Web page encoded in MIME (Multipurpose Internet
Mail Extensions - adding a header to describe the encoding)
-
HEAD: request to read a Web page header
-
PUT: request to store a Web page (may include authentication headers)
-
POST: request to append new data to a Web page (e.g. posting a message
to a news group)
-
DELETE: request to delete a Web page (may include authentication headers)
-
LINK: Connects two existing pages
-
UNLINK: breaks an existing connection between two pages
-
Writing WB page in HTML
-
URL (Universal Resource Locator): a mechanism for naming and locating Web
pages
-
<Protocol>://<DNS name of the host>/<File name (with possible
shortcuts, e.g. ~user)>
-
Protocols:
-
The HTML language
-
Standardized markup language: how the documents are to be formatted and
reformatted (e.g. LaTeX), contrasted to WYSIWYG (not standardized, does
not allow reformatting)
-
Tags with parameters, e.g. <IMG SRC ="http://www.widget.com/image.gif"
ALT="AWI Logo">
-
Hyperlinks:
-
<A HREF="http://www.nasa.gov"> NASA's home page </A>
-
<A HREF="http://www.nasa.gov"> <IMG SRC="shuttle.gif" ALT="NASA">
</A>
-
Forms (HTML 2.0): two-way trafic between the page owner and page user,
<INPUT> tag. Example: Look-Up
Classes
-
CGI (Common Gateway Interface): a standard for handling forms' data. Example:
interfacing a database:
-
Write a script (program) to interface between a database and the Web
-
Store the script in the cgi-bin directory under an URL.
-
When retrieves a page located in cgi-bin the HTTP server executes
the script.
-
Put the script name in the ACTION parameter of the form.
-
The browser invokes the operation specified in METHOD (usually POST)
-
The script is started and presented with the form input data.
-
After the database retrieval the script produce an HTML file, which is
sent back to the browser.
-
This mechanism can be used to include selected ads in the Web page depending
on what the user is looking for.
-
Java
-
Applets
-
Implementing highly interactive Web pages
-
Running applications and using remote servers or databases
-
Adding animation and sound to the Web page without spawning and external
viewer
-
No need of standard for the Web protocols (HTTP, FTP etc.)
-
The Java system for the Web
-
A Java-to-bytecode compiler
-
A browser that understands applets (<APPLET> tag)
-
A bytecode interpreter
-
Security issues
-
Problem: running a program on the client's machine (possible crash, collecting
private information, consuming resources)
-
Solutions:
-
First barrier: strong typed language - array bounds checking, no pointers
(thus no memory access)
-
Second barrier: bytecode verifier
-
Third barrier: class loader (loading first system classes before looking
for user classes)
-
Fourth barrier: security measures built in the classes, e.g. file access
class (asking user if the applet needs file access).
-
More problems:
-
asking to access the /tmp directory
-
requiring network access to work
-
covert channels
-
Locating information on the Web
-
The
Web Challenges (How to turn the web data into web knowledge)
-
Web Search Engines
-
Topic Directories
-
Semantic Web
-
Crawling the Web
-
Information Retrieval and Web Search
-
Web Agents
-
Is
There an Intelligent Agent in Your Future?
-
A good internet agent needs to be:
-
Communicative: able to understand your goals, preferences and constraints.
-
Capable: able to take options rather than simply provide advice.
-
Autonomous: able to act without the user being in control the whole time.
-
Adaptive: able to learn from experience about both its tasks and about
its users preferences.
-
Agent research sites:
-
Semantic Web
MIPS Programming assignment
Due date: February 19
Write a program in MIPS assembler to perform some simple computation (e.g.
average of three integers, converting miles into kilometers, Fahrenheit
into Celsius etc.). The program must include:
-
At least one instruction from each instruction type: R-type arithmetic,
I-type arithmetic and Memory transfer.
-
Input and output through system calls.
-
Comments explaining the format and the meaning of each instruction.
Use the SPIM simulator
to debug and run the program. You may find additional information about
MIPS programming using SPIM at CS
254 - Computer organization and assembly language programming and Introduction
to RISC Assembly Language Programming (see example programs).
Documentation and submission: Submit the source text of
the program as a file attachment through CCSU Pipeline > Student > Blackboard
Vista Courses > CS-502-70 > Programming Assignment.
Mid-term test
The Midterm test is available from Vista. There are 20 multiple choice
or short answer questions that have to be answered within 3 hours.
The test includes the following topics:
-
Number systems (binary, hexadecimal, two's complements, floating
point)
-
MIPS instruction set architecture
-
Single cycle datapath and control
-
Basic principles of the multi-cycle datapath operation (slide 24 of Week
3)
-
Identifying dependencies, and solving pipeline hazards
-
Memory hierarchy (principles of caches)
Building a 16-bit ALU
Due date: March 4
Design a 16-bit ALU unit that performs 5 functions on two 16-bit data inputs
(a and b) according to the following table:
| ALU operation bits |
Operation |
Result |
| 000 |
Logical AND |
a & b |
| 001 |
Logical OR |
a | b |
| 010 |
Add |
a + b |
| 110 |
Subtract |
a - b |
| 011 |
Rotate left one bit |
Result0 = a15; Result(i+1) = a(i), i=0,...,14 |
Along with the Result output (delivering the result of the operation)
the ALU must have the following additional outputs:
-
One-bit Zero output which is set to 1 when all bits of the
Result are 0.
-
One-bit Overflow output which is set to 1 when an arithmetic overflow
occurs. Note that the non-arithmetic operations should not set the overflow
bit. Use a box for the overflow block diagram showing only the inputs and
the out and describe separately its logic by a truth table.
Use hierarchical design and draw the logic diagram of the 1-bit ALU and
then use block level diagram to draw the whole 16-bit ALU. Include
also a short text description of your design.
Documentation and submission: Write a report containing
descriptive
text, the ALU function code table, the overflow truth table, the 1-bit
ALU diagram and the 16-bit ALU diagram. Use MS Word, PDF or HTML format.
Submit your report as a file attachment through CCSU Pipeline > Student
> Blackboard Vista > CS-502 > ALU Assignment. The maximum grade for this
assignment is 10 points (1/3 of the Midterm Project grade).
Mid-term project (30 ponits): Designing
a mini MIPS machine
Due date: March 27
Description: Design a simplified version of the MIPS machine - single
cycle implementation. Use the components described in the lecture slides
and build the machine hierarchically, i.e. draw and describe simpler elements
and then connect them into a more complex diagram. The top level diagram
should describe the datapath and control (including all control signals).
Include a table describing the functionality of the control unit. Do the
drawing by a computer drawing program (e.g. Paint). You can also use and
edit diagrams from the lecture slides.
The machine should have the following
components (described as blocks with proper inputs and outputs):
-
General purpose registers (register file): 4 registers, 16-bit long,
numbered 0 - 3.
-
Other registers: Program counter (16 bits).
-
Instruction Memory: Byte addressing, 16-bit word and 16-bit address.
-
Data Memory: Byte addressing, 16-bit word and 16-bit address.
-
ALU: 16 bit (include the assignment for week #5 described here).
-
Main control unit (include a truth table).
-
Other components necessary to connect the main components: sign
extend, shift unit, multiplexes, gates (for the branch logic).
Instruction set:
| Instruction |
Opcode |
| add |
0000 |
| sub |
0001 |
| and |
0010 |
| or |
0011 |
| rol (rotate left one bit) |
0100 |
| addi |
0101 |
| lw |
0110 |
| sw |
0111 |
| beq (branch on equal) |
1000 |
| bne (branch on not equal) |
1001 |
For the meaning of the instructions see the Patterson & Hennessy's
text.
Instruction formats:
R-format (add, sub, and, or, rol)
| op |
rs |
rt |
rd |
unused |
| 4 |
2 |
2 |
2 |
6 |
I-format (lw, sw, beq, bne, addi)
| op |
rs |
rt |
address/offset/value |
| 4 |
2 |
2 |
8 |
The project report should include the following:
-
Student name and project title
-
A descriptive text outlining the design of the machine (instruction set
architecture, main components);
-
A description of the 16-bit ALU including a descriptive text and diagrams
(include your report for the ALU assignment or a reference to it);
-
A diagram showing the datapath and control including all data and control
lines marked with their width (number of bits);
-
A table describing the input/output behavior of the main control unit.
Submission: Write a report with your name on it containing
descriptive text and all diagrams and tables. Use MS Word, PDF or HTML
format. Submit your report as a file attachment through CCSU Pipeline >
Student > Blackboard Vista > CS-502 > CS-502 >MIPS Machine. The maximum
grade for this projects is 30 points, which includes the maximum grade
for the ALU assignment (10 points).
Extra credit (maximum 10% of the project grade): Implementing
additional MIPS instructions, which require changes both in the diagram
and in the control table.
Term paper (40 points)
Posted: 04/07/2008
Due: 05/15/2008
Write a paper that analyzes a particular distributed computing environment.
Examples of such environments are: a company computer and/or telephone
network, a university campus network, hospital computer network, library
network, ticket reservation system etc. Choose an environment that you
know best and that is rich enough so that you can cover all of the
following issues (including the sub items):
-
General description (describe the environment in general terms, i.e. the
name/type and location of the network, who are the users and for what purpose
it is used, what is the general structure etc.)
-
General requirements for the distributed system architecture to be used
(in terms of cost, efficiency, security etc.).
-
Description of the distributed system architecture currently being used
-
Sources and nature of information and data (voice, image, computer data
files etc.)
-
Basic characteristic of the data (location/distribution, media, volume,
access/security)
-
Database and/or information systems
-
Computing resources: computers (conventional, specialized, multiprocessors)
and other information processing equipment
-
Switching and network technology
-
Signals and communication channels (analog/digital, bandwidth, AD/DA conversion)
-
Problems with the current distributed system architecture (requirements
that are not met, inadequate resources or technology)
-
Suggestions for improvement of the existing system architecture or brief
description of a new architecture to meet the requirement
Format of the term paper
Write the paper following the general format of a scientific paper:
title, author, abstract, introduction (short description of the subject,
goals, structure of the paper), main text (structured into sections
and covering the above 5 issues), conclusion and references. Include all
materials you use (including WEB resources) in the reference section and
cite them properly in the text. Include as much as possible illustrative
material (graphs, charts, diagrams, pictures). Type and print the paper
using a word processing system (such as Microsoft Word). An example of
a term paper can be found here.
Policy on copying
The term paper must be original, i.e. entirely written by yourself.
Since the paper describes an existing system obviously you have to use
other materials (books, reference materials, web sources, personal communication).
Every time you use such a material you have to include a citation and put
the title and authors of this material in the reference section. Then you
can rephrase or even copy a part of the material (provided that you put
it in quotes). If you describe general ideas and approaches you have also
to cite their authors. If the latter is yourself, then use something
like "the author believes that .." or " we suggest ...". Papers that
include copied paragraphs and phrases without acknowledgment to their authors
will get a grade of F. For more information see the CCSU
Graduate Policy on Academic Misconduct.
Term paper grading
The term paper grade will be based on the paper format (according
to the above mentioned structure), comprehensibility and presentation
(self-contained
explanations of the major issues with minimal reference to technical terms
and brand names outside the area) and completeness
of the
material (i.e. all 5 issues mentioned above including the subitems if item
3 should be discussed in sufficient detail).
Submission
Submit the paper as an attachment through Vista/Term paper.
Extra credit
An extra credit of maximum 10% will be given for the preparation of presentation
slides (e.g. in PowerPoint format). Prepare no more than 10 slides, that
outline your paper. Slides should omit the details and discuss the most
important issues including the problems and suggestions for improvement
of the presented distributed environment.
Quiz (max grade 5 pts.)
The quiz is available in Vista. There are 20 multiple
choice or short answer questions that will have to be answered within 1
hour.
Review topics
-
Categories of computer systems:
-
Conventional sequential computers
-
Conventional systems with special purpose components
-
Multiprocessor systems
-
Distributed systems
-
Distributed systems:
-
Issues: data location and security, load distribution, process migration,
fault tolerance
-
Types: homogeneous systems, heterogeneous systems
-
Information Theory
-
Probability: Bayes's theorem
-
Measuring information: basic formula (logarithmic scale, halving strategy),
entropy (average information)
-
Signals:
-
Representation: time, amplitude, frequency
-
Transmission: sampling (Sampling Theorem), digitizing (AD and DA conversion),
bandwidth
-
Switching:
-
Circuit switching
-
Packet switching
-
Time division multiplexing
-
Ethernet approach: packets, conflicts and resending
-
DBMS:
-
Application area: large quantity of structured data
-
Types: tables (relational), trees, networks, objects
-
Data independence: database schema
-
Relational DBMS
-
Basics of SQL
-
Query processing: plans, optimization
-
Database views
-
Data integrity
-
Integrity constraints
-
Concurrency control
-
Backup and recovery
-
Security and access control
-
Directory systems: purpose, types
-
Information retrieval
-
Application area: finding relevant data using irrelevant keys (not well
structured data)
-
Text document retrieval: retrieval queries, retrieval quality
-
Document classification and clustering: document representation (vector
space model), similarity measures.