CS502 - Computing and Communication Technology
Prerequisites: Must be enrolled in graduate level
Description: The course offers a comprehensive coverage
of the basic concepts of Computing
and Data
Science. The first part of the course is devoted to the Computer
Organization and Design. This part discusses the main components of
computers and the basic principles of their operation. It demonstrates
the relationship between the software and hardware and focuses on the foundational
concepts that are the basis for current computer design. The second part
of the course discusses the fundametals of Data Science and some
related areas. The main emphasis here is on data distribution, as the way
data are generated, stored and used is naturally distributed. This part
discusses various levels of transmitting, storing and using information,
data and knowledge and surveys important areas as Information Theory, Switching,
Databases, Data classification, Distributed memory and data retrieval,
World Wide Web and Data/Web Mining.
Course objectives: Upon successful completion of the course the
student will be able to
Understand the basics of the Von Neumann computing architecture
Understand the basics of the MIPS instruction set architecture and write
simple assembly language programs
Design simple ALU and CPU at an abstract logical level
Understand the principles of Distributes Systems
Understand the basics of important related areas as Information Theory,
Switching, Databases, Information retrieval, World Wide Web and Data/Web
Required textbook: David A. Patterson and John L. Hennessy, Computer
Organization and Design: The Hardware/Software Interface, Fourth Edition,
Elsevier, 2008, ISBN: 978-0-12-374493-7.
Required software: SPIM
simulator: A free software simulator for running MIPS R2000 assembly
language programs available for Unix, DOS, and Windows.
WEB resources:
Class Participation: Regular attendance and active class participation is expected from all students.
is expected from all students. If you must miss a class, try to inform
the instructor of this in advance. In case of missed classes and work due
to plausible reasons (such as illness or accidents) limitted assistance
will be offered. Unexcused absences will result in the student being totally
responsible for the make-up process.
Assignments, projects and grading: There will be a mid-term
test, a mid-term project and a term paper. There will
be also a programming assignment and a quiz. The final grade
will be based 40% on the term paper, 30% on the mid-term project, 20% on
the test, 5% on the programming assignment and 5% on the quiz, and will
be affected by classroom participation, conduct and attendance. All grades
will be availabe in Blackboard Vista. The letter grades will be
calculated according to the following table:
All assignments with their due dates are listed below in the class schedule
(the project descriptions are given on separate pages) and must be submitted
Unexcused late submission policy: Assignments submitted more
than two days after the due date will be graded one letter grade
down. Projects submitted more than a week late will receive
letter grades down. No submissions will be accepted more
than two weeks after the due date.
Honesty policy: The CCSU honor code for Academic Integrity is
in effect in this class. It is expected that all students will conduct
themselves in an honest manner and NEVER claim work which is not their
own. Violating this policy will result in a substantial grade penalty,
and may lead to expulsion from the University. You may find it online at
Please read it carefully.
Tentative schedule of classes, assignments, projects and tests (by week)
Note: Dates will be posted for all classes, project and test
due days. Check the schedule regularly for updates!
August 29, September 5, 10, 12: Computer
instruction set architecture
September 17, 19, 24: Computer
arithmetic and the ALU design
September 26: Programming
assignment due
September 26, October 1: CPU
datapath and control
October 3, 8, 10: Pipelining
October 10: Building
a 16-bit ALU (part of the Midterm
Project) due.
October 15, 17: Memory
October 22, 24: Interfacing
peripherals and multiprocessors
October 29: Midterm
Project due
November 1-7: Mid-term
test. To be taken online through Blackboard Vista
November 5, 7: No classes
November 12: Fundamentals
of distributed systems
November 14: Information
November 19: Switching
November 26: Database
management concepts
November 28: Distributed
memory and data retrieval
December 3: The
World Wide Web
December 5: Introduction
to Data Mining or term paper presentations
December 5-18: Quiz.
To be taken online through Blackboard Vista
December 17, 7-9: Last class (optional)
December 18: Term
paper due
CS502 - Week 1
Computer Architecture = Instruction Set Architecture + Machine Organization
Reading: Patterson & Hennessy - Chapter 1, Sections 2.1
- 2.3, 2.5 - 2.7, 2.10, 2.13 (optional), 2.16 - 2.20 (optional), Appendix
B, "Spim, pcspim, and xspim" in Section "Software" on the CD.
Lecture Slides: PDF
Lecture Notes:
Levels of Abstraction
Operating System
Instruction Set Architecture:
Organization of Programmable Storage
Data type and Structures: encodings and machine representation
Instruction set
Instruction Formats
Addressing Modes and Accessing Data and Instructions
Exception Handling
Instruction Set Processing
I/O System
Digital Design
Circuit Design
Basic Components of a Computer
Processor: Datapath and Control
Example: implementing A=B+C (from instructions to gates)
MIPS instructions
Memory organization and load and store
Instruction types and formats
I-type for arithmetic
I-type for lw and sw
I-type for branch
Stored program concept: programs in memory, fetch&execute cycle
Von Neumann Architecture
CPU, Memory System, I/O system
Stored program concept: programs in memory, fetch&execute cycle
Instructions are executed sequentially
Turing machines
Non-Von Neumann Architecture:
The SPIM simulator
Exercises: Load this
program in the SPIM simulator and analyze the format of the insturctions.
Run the program with different values of X and Y and trace the execution
in step mode.
CS502 - Week 2
Computer arithmetic and ALU design
Reading: Patterson & Hennessy - Sections 2.4, 3.1
- 3.2, 3.3 - 3.4 (optional), 3.5 (floating-point representation only),
3.6 - 3.10 (optional), C.5 (CD).
Lecture Slides (PDF),
ALU diagram (PDF)
Lecture Notes:
Representing numbers: sign bit, one's complement, two's complement.
Arithmetic: addition, subtraction, detecting overflow.
Building ALU - hierarchical approach
Floating point numbers
Scientific notation: (-1)^sign * significand * 2^exponent
Range and precision (overflow and underflow).
IEEE 754 floating point standard - allows integer comparison:
normalized representation
implicit leading 1
exponent is biased: exponent in [0..0 (most negative), 1..1 (most positive)]
bits of the significand represent the fraction between 0 and 1.
(-1)^S * (1 + s1*2^-1 + s2*2^-2 + ...) * 2^(exponent-bias)
Problems with floating point arithmetic
Tutorials and practice quizzes:
Two’s complement numbers:
Floating point:
CS502 - Week 3
CPU datapath and control
Reading: Patterson & Hennessy - Sections 4.1 - 4.4
Lecture Slides: PDF
Lecture Notes:
I. Building a Datapath
Abstract level implementation:
Instruction memory
Program counter
Register file
Data memory
Basic building elements
Combinational logic (e.g. ALU)
State elements (registers, memory)
Clocking methodology: edge triggered
Fetching instructions and incrementing the program counter
Register file and execution of R-type instructions
Datapath for lw and sw instructions (add data memory and sign extend)
Datapath for branch instructions
II. Control
ALU control: mapping the opcode and function bits to the ALU control inputs
Designing the main control unit
Operation of the Datapath (single-cycle implementation):
R-type instructions
Load (store) word
Branching instructions
Problems of the single-cycle implementation
CS502 - Week 4
Reading: Patterson & Hennessy - Sections 4.5 - 4.8 (without
implementation details).
Lecture Slides: PDF
Lecture notes:
I. Introduction to Pipelining
Pipelining by analogy (laundry example):
Pipelining helps throughput of the entire workload
Multiple tasks operating simultaneously and using different resources
Potential speedup = number of pipe stages
The pipeline rate is limited by the slowest stage
Unbalanced lengths of pipe stages reduces the speedup
The time to "fill" the pipeline and the time "drain" it reduces the speedup
Stall for dependencies
Five stages of the load MIPS instruction
Insturction fetch.
Instruction decode.
Execute (ALU operation).
Memory access.
Write back (register write).
The pipelined datapath
Single cycle, multiple cycle vs. pipeline
Advantages of pipelined execution
II. Problems with pipelining (pipeline hazards)
Structural hazards: single memory
Control hazards:
Stall: wait until decision is clear
Predict: fixed prediction (e.g. fail), dynamic prediction (based on history)
Delayed brach (software solution):
add $4, $5, $6
beq $1, $2, $40
beq $1, $2, 40 ==> add
$4, $5, $6
lw $3, 300($0)
lw $3, 300($0)
Data hazards (dependecies backwards in time):
Forwarding (bypassing - hardware solution)
Reordering code (software solution)
lw $t0, 0($t1)
lw $t0, 0($t1)
lw $t2, 4($t1) ==>
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
CS502 - Week 5
Memory hierarchies
Reading: Patterson & Hennessy - Sections 5.1 - 5.5.
Lecture Slides: PDF
Lecture notes:
Memory technologies and trends
Impact on performance
The need of hierarchical memory organization
The principle of locality
Memory hierarchy terminology
The Basics of caches
Direct-mapped cache: Accessing a cache (Cash index,Tag,Valid bit)
Writing to the cache (write-through and write-back schemes)
Improving cache performance by flexible placement of blocks in the cache
in PS, slides
in PDF)
Direct mapped: Cache index = (Block address) modulo (Cache size); no search;
small tag.
Set associative: Cache index = (Block address) modulo (Number of sets in
cache); search the set; larger tag.
Fully associative: Cache index is not determined; search the whole cache;
tag = address.
Choosing which block to replace: least recently used
The need of virtual memory
Many programs (processes) can use a single memory
Use a memory exceeding the size of the main memory
VM organization and terminology: virtual address, physical address, page,
page offset, page fault, memory mapping (translation).
Addressing pages:
Page table, page table register
Processes (active, inactive) and page tables
Page faults
Replacing pages: LRU, reference (use) bit
Write-back scheme (dirty bit)
Write a sequence of memory references for which:
the direct mapped cache performs better than the 2-way associative cache;
the 2-way associative cache performs better than the fully associative
Exercises 5.3, 5.4 (pages 550, 551).
CS502 - Week 6
Interfacing peripherals and multiprocessors
Reading: Patterson & Hennessy - Chapters 6.1 - 6.8, 7.1
- 7.4
Lecture Slides: Chapter7.pdf
Lecture notes:
Interfacing Processors and Peripherals - Buses (slides
in PDF)
Buses: lines, transactions, types
Synchronous and asynchronous buses
Handshaking protocol
Bus access: master and slave
Bus arbitration schemes
Bus standards
Interfacing I/O devices to Memory, CPU and OS (slides
in PDF)
The role of the operating system in interfacing I/O devices to Memory
Controlling the I/O devices
Memory mapped I/O
Special I/O instructions
Communicating with the processor
Interrupt-driven I/O
Direct memory access (DMA)
Multiprocessors (slides
in PDF)
Basic approaches to sharing data and types of connectivity
Programming multiprocessors
Multiprocessors connected by a single bus
A parallel program
Multiprocessor cache coherency
Implementing a multiprocessor cache coherency protocol
Synchronization using coherency
Networks of muiltiprocessors (slides
in PDF)
Shared memory vs. multiple private memories
Centralized memory vs. distributed memory
Parallel programming by message passing
Distributed memory communication
Memory allocation
Clusters and network topology
Modern clusters
CS502 - Week 7
Fundamentals of distributed systems
Additional Reading: Andrew S. Tanenbaum, Maarten van Steen,
Distributed Systems: Principles and Paradigms.
Lecture Notes (in PDF):
Categories of computer systems
Conventional sequential machines (mainframes, minicomputers + network
of terminals) - multitasking, multiusers.
Conventional systems with special purpose components (specialized processors)
- single specialized task.
Multiprocessor systems - single task allowing parallel computation.
Distributed systems (computers connected by a network) - different task,
shared data.
Conventional systems with special purpose components
A special purpose unit (e.g. math processor) attached to the main bus
Back-end system (additional separate machine, e.g. graphic terminal)
Example: iDBP
file operations: positioning and manipulating a cursor in a file
used to implement relational database systems
add-on board or back-end system
Multiprocessor systems:
Multiple processors
Shared memory (single address space) vs. multiple private memories
Centralized memory vs. distributed memory
Categories of parallelism:
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Multiple instruction streams, single data stream (MISD)
Multiple instruction streams, multiple data streams (MIMD)
Distributed computer systems
data location and security
load distribution
process migration
fault tolerance
homogeneous systems
heterogeneous systems
Distributed file systems:
Fault-tolerant networks:
redundancy (static-masking, dynamic)
consistency (strong, weak)
Firewalls and self-stabilizing networks
Examples of distributed systems
CS502 - Week 8
Information Theory
Fundamentals of information theory (information vs. data)
Probability (see http://en.wikipedia.org/wiki/Probability
or http://mathworld.wolfram.com/Probability.html)
Frequency approach - experiments
Objective approach - properties real world entities
Subjective approach - the uncertainty of the agent (agent's knowledge)
Example: the probability that the sun will rise tomorrow - undefined, 1,
1-e, Laplas estimate: (d+1)/(d+2); Laplas for k-outcomes: (n+1)/(N+k)
Joint probability: P(A.B)=P(A)*P(B) (if A
and B are independent)
Conditional probability: P(A.B)=P(A|B)*P(B)
Bayes Theorem (more in PDF):
Interpretation of Bayes: A - cause; B - effect; posterior=likelihood*prior/evidence
Law of total probability: A={A1,A2,...An},
exhaustive, mutually exclusive (P(Ai.Aj)=0;i\==j); P(B)=SUM
Claude Shannon (see http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html)
Measuring information
Settings: source, message, recipient
Recipient's point of view: a measure of uncertainty, expected surprise
of getting the message
Basic formula: I(M)=-P(M) or I(M)=-log2P(M)
(minimal number of bits to encode the message)
Logarithmic scale: additivity: I(M1)+I(M2)=-logP(M1)-logP(M2)=-log(P(M1)P(M2))=I(M1
and M2)
Log 2 scale: reducing the uncertainty (halving strategy, guess a number
in [0,7]), measuring unit: bit
Entropy (average information): uncertainty of the recipient, the expected
surprise of getting an arbitrary message from a set of possible messages.
Example1: H({A,~A})=-P(A).logP(A)-P(~A).logP(~A). Maximum
for P(A)=1/2, minimum for P(A)=1
Example2: X={0,1,...,7}, P(X=1)=P(X=2)=...=p(X=7)=1/8;
H(X)=-8*(1/8)*log(1/8)=3 (bits)
Representing signals: frequency, time, amplitude
Baseband signals: F(t) limited in frequency
in the interval [-W,W]
Broadband signals (carrier based): moved in the interval [Wc-W,Wc+W],
e.g. F(t)=Fs(t)sin(Wct)
Representing signal by sample points (theoretically, infinite number of
points required)
locating a point in space: coordinate systems
locating a function F(t) : vector
tn - sample points
orthonormal sets of functions: Integral(a,b)
= {1 for n=m; 0 for n\=m}
expansion of a function (Fourier series): F(x)=Sum(n=-inf,
Cm=Integral(-inf,+inf) F(x)Gm(x)dx
Sampling theorem (see also http://mathworld.wolfram.com/SamplingTheorem.html)
in [-T,T], T-> +inf; Sincn(t)=sin(2piW(t-n/(2W))/2piW(t-n/(2W)
F(t)=Sum(n=-inf, n=+inf) Cn
(all terms above are 0 except one, sin(0)/0=1)
F(t)=Sum(n=-inf, n=+inf) F(tn)sin(2piW(t-tn)/2piW(t-tn),
tn=n/(2W) (sampling
constraints: F(t) is limited in frequency
in [-W,W], infinite time duration (infinite
number of sample points)
AD and DA conversion: quantization (thresholds), amplifiers vs. repeaters,
generally very low error rates (regenerating digital values at relay points)
Error detection and correction
Binary case: {000,111} for {0,1} - one error correcting code
Arbitrary waveform: three instead of one sample point - widening the bandwidth
(2TW dimensions)
Applets from Information Technology: Inside and Outside by David Cyganski,
John A. Orr with Richard F. Vaz.
CS502 - Week 9
The importance of switching in communication - the cost of switching is
Definition: transfer input sample points to the correct output ports
at the correct time
Digital switching (sample points amplitudes are 0's and 1's)
Circuit switching
Packet switching
Voice digitization: W=3KHz, sampling at 2*3=6 or 8KHz, 256 levels for quantization
(8 bits), Bit rate=64Kb/s
Telephone switching
Time division multiplexing: time slot (0.1 ms), field, frame; 125ms/0.8=150
channels + time for synchronization and control
Switch architecture
Sampling input signals, storing values in memory, placing values in the
proper field and frame of the output sequence
Need for more channels: hierarchical switching
Combining time and space switching
General framework for switching
time, space and frequency (broadband signals) switching
synchronization (single clock) and buffering (memory)
set-up time and delay (propagation time)
"call duration" assignment vs. dynamic assignment
in-band and out-of-band signaling
Circuit (synchronous) vs. packet (asynchronous) switching: control and
routing overhead, virtual packet switching
Switching techniques and networking: switching is the technology allowing
to get a message between the nodes of a network
Crossbar switching: mechanical (in the past) or electronic.
Bus and cable switches: computer buses or cables (switching + transportation
= network)
Token passing approach (similar to the locks used by multiprocessors connected
by a bus)
Ethernet approach: cable or ring, packets, conflicts, resending
Synchronization and Hub switch: star networks (no conflicts)
slides in PPT
CS502 - Week 10
Database management concepts
Database Management Systems (DBMS)
Managing large quantity of structured data
Efficient retrieval and modification: query processing and optimization
Sharing data: multiple users use and manipulate data
Controlling the access to data: maintaining the data integrity
An example of a database (relational): relations (tables), attributes (columns),
tuples (rows). Example query: Salesperson='Mary' AND Price>100.
Database schema (e.g. relational): names and types of attributes, addresses,
indexing, statistics, authorization rules to access data etc.
Data independence: separation of the physical and logical data (particularly
important for distributed systems). The mapping between them is provided
by the schema.
Architecture of a DBMS - three levels: external, conceptual and internal
Types of DBMS
The data structures supported: tables (relational), trees, networks, objects
Type of service provided: high level query language, programming primitives
Basic DBMS types
Linear files
Sequence of records with a fixed format usually stored on a single file
Limitation: single file
Example query: Salesperson='Mary' AND Price>100
Hierarchical structure
Trees of records: one-to-many relationships
Limitations. ITEM-SUPPLIER: many-to-many relationship (requires duplicating
records); problems when updated
Retrieval requires knowing the structure (limited data independence): traversing
the tree from top to bottom using a procedural language
Network structure
Similar to the hierarchical database with the implementation of many-to-many
Record and set types
Relational structure
Relations, attributes, tuples
Primary key (unique combination of attributes for each tuple)
Foreign keys: relationships between tuples (many-to-many). Example: SUPPLIES
defines relations between ITEM and SUPPLIER tuples.
Advantages: many-to-many relationships, high level declarative query
language (e.g. SQL)
SQL example (retrieve all items supplied by a supplier located in Troy):
SUPPLIER.Supplier# = SUPPLIES.Supplier#
Programming language interfaces: including SQL queries in the code
Object-Oriented structure
Objects (collection of data items and procedures) and interactions between
Is this really a new paradigm, or a special case of network structure?
Separate implementation vs. implementation on top of a RDBMS
Retrieving and manipulating data: query processing
Parsing and validating a query: data dictionary - a relation listing
all relations and relations listing the attributes
Plans for computing the query: list of possible way to execute the
query, estimated cost for each. Example:
SELECT ItemNames, Price
WHERE SALES.Item# = ITEM.Item# AND Salesperson="Mary"
Index: B-tree index, drawbacks - additional space, updating; indexing
not all relations (e.g. the keys only)
Estimating the cost for computing a query: size of the relation, existence/size
of the idices.Example: estimating Attribute=value with a given
number of tuples and the size of the index.
Query optimization: finding the best plan (minimizing the computational
cost and the size of the intermediate results), subsets of tuples, projection
and join.
Static and dynamic optimization
Database views: creating user defined subsets of the database, improving
the user interface. Example:
CREATE VIEW MarySales(ItemName,Price)
AS SELECT ItemName, Price
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary"
Then the query:
FROM MarySales
WHERE Proce>100
translates to:
WHERE ITEM.Item#=SALES.Item# AND Salesperson="Mary" AND Price>100
Data integrity
Integrity constraints: semantic conditions on the data
Individual constraints on data items
Uniqueness of the primary keys
Dependencies between relations
Concurrency control
Steps in executing a query.
Concurrent users of the database, interfering the execution of one query
by another
Transaction: a set of operations that takes the database from one
consistent state to another
Solving the concurrency control problem: making transactions atomic
operations (one at a time)
Concurrent transactions: serializability theory (two-phase locking), read
lock (many), write lock (one).
Seriazible transactions: first phase - accumulating locks, second phase
- releasing locks.
Deadlocks: deadlock detection algorithms.
Distributed execution problems: release a lock at one node (all locks accumulated
at the other node?); strict two-phase locking
Backup and recovery
The problem of keeping a transaction atomic: successful or failed (What
if some of the intermediate steps failed)
Log of database activity: use the log to undo a failed transaction.
More problems: when to write the log, failure of the recovery system executing
the log.
Security and access control
Access rules for relations or attributes. Stored in a special relation
(part of the data dictionary).
Content-independent and content-dependent access control
Content-dependent control: access to a view only or query modification(e.g.
and-ing a predicate to the WHERE clause)
Discretionary and mandatory access control
Client-Server architectures
Knowledge Bases and KBS (and area of AI)
Information, Data, Knowledge (data in a form that allows reasoning)
Basic components of a KBS
Knowledge base
Inference (reasoning) mechanism (e.g. forward/backward chaining)
Explanation mechanism/Interface
Rule-based systems (medical diagnostics, credit evaluation etc.)
slides in PPT
CS502 - Week 11
Distributed memory and data/informition retrieval
The need of memory hierarchy
Massive data storage requirements (e.g. a satellite orbiting Earth generates
1 terabyte data per day)
Difference in the cost and capacity of different types of memory
Data location
Data location factors
Application software
Directory - a mechanism to locate a data object
Directory location: may be different from the data location
Directory server: a separate system, not a part of the application program
Directory structure: usually hierarchical
Directory search: finding out the data location and making use of this
fact (e.g. designing an application)
Providing access control
Directory systems
Pure approaches
Master control directory: a single directory containing all the information
Directory server: a single computer providing all directory functions (contains
the master directory)
Fully replicated directories: copies of the directory at different locations
(concurrency control required)
Local directories: contain information on local data and usually stored
at the same location
Point of control directories: located at the point where the access control
is exercised (e.g. hard disk)
Problems with scaling: hierarchical system of directories (different
directories at different levels)
Important issues
Directory backup
Coherency and consistency
Local caching
Directories and access control (different functions). Points of access
Application program (problems with using data provided by other applications)
DBMS (the most common approach)
Control at the physical location of data (e.g. library)
Function of the communication networks (e.g. preventing a user from accessing
another node, telephone system)
Information retrieval (Slides
in PPT)
Finding relevant data using irrelevant keys
Example: database of photographic images sorted by number, date.
DBMS: Well structured data according to the information content
Text document retrieval (not well structured data)
Document retrieval queries
NLP: requires structured data to match the translated query
Abstracting documents
Keyword search: boolean, weighted
Evaluating retrieval quality
Precision: retrieved and relevant /retrieved
Recall: retrieved and relevant /relevant
Vector space model
Document and query vectors
Relevance ranking: cosine similarity
Access methods
Full text scanning: hardware support
Document signature: a fixed length bit string for each document (precision
and recall easily computed)
Inverted index: keyword or phrase indexing
Information Retrieval Resources and Demos: http://ir.exp.sis.pitt.edu/resources/
Web document retrieval (Slides
in PDF)
Crawling the Web
Analyzing the Web structure (Slides
in PDF)
Social Networks
Page Rank
Authorities and Hubs
Web document retrieval books and atricles
Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage,
Chapters 1, 2
Mining the
Web: Discovering Knowledge from Hypertext Data, Chapters 7, 8.
The PageRank Citation
Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev
Motwani, and Terry Winograd. Stanford Digital Library Technology Project,
Sources in a Hyperlinked Environment Jon M. Kleinberg. Journal of
the ACM 46:5 pp. 604-632, 1999
Document classification and clustering:
Document attributes: semantic (NLP, domain knowledge), statistical (keyword
frequencies, vector space model, TFIDF), mixed
Similarity measures: cosine similarity, distance, information compression
The Machine Learning approach
Bayesian text classification (reading: Tom
Mitchell, Machine Learning, McGraw Hill, 1997)
Clustering algorithms (reading: A
Comparison of Document Clustering Techniques)
Flat clustering: cluster and centroid (typical cluster document), k-means
Hierarchical clustering (example Dendrogram)
Web Mining
Web content mining - discovery of Web document content patterns (text mining).
Web structure mining - discovery of hypertext/linking structure patterns
use hyperlinks to enhance text classification
page ranking
modeling and measuring the Web
Web usage mining - discovery of web users activity patterns
mining web server logs
mining client machine access logs
Web Mining books
CS502 - Week 12
The World Wide Web
A little history
1989, CERN: distribution of linked documents (nuclear physics)
1991, text-based prototype
1993, First graphical interface - Mosaic
1994: WWW consortium (CERN, MIT):
The client side
WEB documents (pages) connected via hyperlinks (hypertext).
Hyperlinks: highlighted strings in text or images.
Browsers (text-based or graphical): tools for navigating the WEB.
Forms and Java applets
The server side
URL (Uniform Resource Locator): <protocol name>://<machine name>/<file
name>, e.g. http://www.w3.org/History.html
Steps of fetching http://www.w3.org/History.html
The browser determines the URL
The browser asks the local DNS (Domain Name System) server for the IP (Internet
Protocol) address
DNS replies with
The browser makes a TCP (Transmission Control Protocol) connection to port
80 on
It then sends a GET /hypertext/WWW/History.html
The www.w3.org server send the file History.html
The TCP connection is released
The browser displays all the text in History.html
The browser displays all the images in History.html (new TCP connection
for each image)
Example of HTTP protocol in text:
C: telnet www.w3.org 80
T: Trying ...
T: Connected to www.w3.org
T: Escape character is '^]'
C: GET /hypertext/WWW/TheProject.html HTTP/1.0
HTTP/1.0 200 Document follows
MIME-Version: 1.0
Server: CERN/3.0
Content-Type: text/html
Content-Length: 8247
<HEAD><TITLE> The World Wide Web Consortium
<H1><IMG ALIGN=MIDDLE ALT="W3C" SRC="Icons/WWW/w3c_main.gif">
The World Wide Web Consortium </H><P>
The World Wide Web is the universe of network-accessible
Using other protocols
HTTP Browser <---> FTP server
HTTP Browser <---> FTP Proxy server <---> FTP server
Proxy servers: translating protocols, caching pages, filtering information
HyperText Transfer Protocol (HTTP)
Simple (GET without the protocol version) and full requests
Methods (commands)
GET: request to read a Web page encoded in MIME (Multipurpose Internet
Mail Extensions - adding a header to describe the encoding)
HEAD: request to read a Web page header
PUT: request to store a Web page (may include authentication headers)
POST: request to append new data to a Web page (e.g. posting a message
to a news group)
DELETE: request to delete a Web page (may include authentication headers)
LINK: Connects two existing pages
UNLINK: breaks an existing connection between two pages
Writing WB page in HTML
URL (Universal Resource Locator): a mechanism for naming and locating Web
<Protocol>://<DNS name of the host>/<File name (with possible
shortcuts, e.g. ~user)>
The HTML language
Standardized markup language: how the documents are to be formatted and
reformatted (e.g. LaTeX), contrasted to WYSIWYG (not standardized, does
not allow reformatting)
Tags with parameters, e.g. <IMG SRC ="http://www.widget.com/image.gif"
ALT="AWI Logo">
<A HREF="http://www.nasa.gov"> NASA's home page </A>
<A HREF="http://www.nasa.gov"> <IMG SRC="shuttle.gif" ALT="NASA">
Forms (HTML 2.0): two-way trafic between the page owner and page user,
<INPUT> tag. Example: Look-Up
CGI (Common Gateway Interface): a standard for handling forms' data. Example:
interfacing a database:
Write a script (program) to interface between a database and the Web
Store the script in the cgi-bin directory under an URL.
When retrieves a page located in cgi-bin the HTTP server executes
the script.
Put the script name in the ACTION parameter of the form.
The browser invokes the operation specified in METHOD (usually POST)
The script is started and presented with the form input data.
After the database retrieval the script produce an HTML file, which is
sent back to the browser.
This mechanism can be used to include selected ads in the Web page depending
on what the user is looking for.
Implementing highly interactive Web pages
Running applications and using remote servers or databases
Adding animation and sound to the Web page without spawning and external
No need of standard for the Web protocols (HTTP, FTP etc.)
The Java system for the Web
A Java-to-bytecode compiler
A browser that understands applets (<APPLET> tag)
A bytecode interpreter
Security issues
Problem: running a program on the client's machine (possible crash, collecting
private information, consuming resources)
First barrier: strong typed language - array bounds checking, no pointers
(thus no memory access)
Second barrier: bytecode verifier
Third barrier: class loader (loading first system classes before looking
for user classes)
Fourth barrier: security measures built in the classes, e.g. file access
class (asking user if the applet needs file access).
More problems:
asking to access the /tmp directory
requiring network access to work
covert channels
Locating information on the Web
Web Challenges (How to turn the web data into web knowledge)
Web Search Engines
Topic Directories
Semantic Web
Crawling the Web
Information Retrieval and Web Search
Web Agents
There an Intelligent Agent in Your Future?
A good internet agent needs to be:
Communicative: able to understand your goals, preferences and constraints.
Capable: able to take options rather than simply provide advice.
Autonomous: able to act without the user being in control the whole time.
Adaptive: able to learn from experience about both its tasks and about
its users preferences.
Agent research sites:
Semantic Web
MIPS Programming assignment
(5 points)
Due date: September 26
Write a program in MIPS assembler to perform some simple computation (e.g.
average of three integers, converting miles into kilometers, Fahrenheit
into Celsius etc.). The program must include:
At least one instruction from each instruction type: R-type arithmetic,
I-type arithmetic and Memory transfer.
Input and output through system calls.
Comments explaining the format and the meaning of each instruction.
Use the SPIM simulator
to debug and run the program and read Patterson & Hennessy, B.9
and "Spim, pcspim, and xspim" in Section "Software" on the CD. You may
also find additional information about MIPS programming using SPIM in CS
254 - Computer organization and assembly language programming and Introduction
to RISC Assembly Language Programming (see example programs).
Documentation and submission: Submit the source text of
the program as a file attachment through Blackboard Vista > CS 502 > MIPS
Programming Assignment.
Mid-term test (20 points)
The Midterm test will be available from Blackboard Vista (the dates will
be posted around midterm). There are 20 multiple choice or short answer
questions that have to be answered within 3 hours.
The test includes the following topics:
Number systems (binary, hexadecimal, two's complements, floating
MIPS instruction set architecture
Single cycle datapath and control
Basic principles of the pipelined datapath
Identifying dependencies, and solving pipeline hazards
Memory hierarchy (principles of caches)
Building a 16-bit ALU (10 points)
Not available at this time.
Mid-term project (20 ponits): Designing
a mini MIPS machine
Not available at this time.
Term paper (40 points)
Not available at this time.
Quiz (5 points)
The quiz will be available in Blackboard Vista. There
are 20 multiple choice questions that will have to be answered within 1
Review topics
Categories of computer systems:
Conventional sequential computers
Conventional systems with special purpose components
Multiprocessor systems
Distributed systems
Distributed systems:
Issues: data location and security, load distribution, process migration,
fault tolerance
Types: homogeneous systems, heterogeneous systems
Information Theory
Probability: Bayes's theorem
Measuring information: basic formula (logarithmic scale, halving strategy),
entropy (average information)
Representation: time, amplitude, frequency
Transmission: sampling (Sampling Theorem), digitizing (AD and DA conversion),
Circuit switching
Packet switching
Time division multiplexing
Ethernet approach: packets, conflicts and resending
Application area: large quantity of structured data
Types: tables (relational), trees, networks, objects
Data independence: database schema
Relational DBMS
Basics of SQL
Query processing: plans, optimization
Database views
Data integrity
Integrity constraints
Concurrency control
Backup and recovery
Security and access control
Information retrieval
Application area: finding relevant data using irrelevant keys (not well
structured data)
Text document retrieval: inverted index, retrieval queries, retrieval quality
(precision, recall), relevance ranking
Analyzing the Web structure: page rank